Compare commits

...

337 Commits

Author SHA1 Message Date
385584b712 Add changelog notes for release 2018-04-29 12:04:44 -07:00
731a6ec23a Update nginx-ingress from 0.13.0 to 0.14.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.14.0
2018-04-28 13:10:03 -07:00
e889430926 Update kube-dns from v1.14.9 to v1.14.10
* https://github.com/kubernetes/kubernetes/pull/62676
2018-04-28 00:43:09 -07:00
d81a091756 Switch Atomic docs to reference v1.10.2 tag 2018-04-28 00:27:23 -07:00
32ddfa94e1 Update Kubernetes from v1.10.1 to v1.10.2
* https://github.com/kubernetes/kubernetes/releases/tag/v1.10.2
2018-04-28 00:27:00 -07:00
681450aa0d Update etcd from v3.3.3 to v3.3.4
* https://github.com/coreos/etcd/releases/tag/v3.3.4
2018-04-27 23:57:26 -07:00
fafa028052 Add Typhoon for Fedora Atomic to changelog 2018-04-27 23:55:59 -07:00
86e5adf348 Set commit hash so tutorials work right now
* These modules are alpha, anyone wanting to try then
is probably fine using the raw sha
2018-04-26 09:08:06 -07:00
a89f25e31a Fix typo in announcement 2018-04-26 08:36:50 -07:00
2e4bf4d7ae Add Fedora Atomic announcement and improve docs 2018-04-26 08:18:39 -07:00
b6a51d0b68 Add architecture docs on operating systems 2018-04-25 22:59:48 -07:00
567e18f015 Fix conflict between Calico and NetworkManager
* Observed frequent kube-scheduler and controller-manager
restarts with Calico as the CNI provider. Root cause was
unclear since control plane was functional and tests of
pod to pod network connectivity passed
* Root cause: Calico sets up cali* and tunl* network interfaces
for containers on hosts. NetworkManager tries to manage these
interfaces. It periodically disconnected veth pairs. Logs did
not surface this issue since its not an error per-se, just Calico
and NetworkManager dueling for control. Kubernetes correctly
restarted pods failing health checks and ensured 2 replicas were
running so the control plane functioned mostly normally. Pod to
pod connecitivity was only affected occassionally. Pain to debug.
* Solution: Configure NetworkManager to ignore the Calico ifaces
per Calico's recommendation. Cloud-init writes files after
NetworkManager starts, so a restart is required on first boot. On
subsequent boots, the file is present so no restart is needed
2018-04-25 21:45:58 -07:00
0a7fab56e2 Load ip_vs kernel module on boot as workaround
* (containerized) kube-proxy warns that it is unable to
load the ip_vs kernel module despite having the correct
mounts. Atomic uses an xz compressed module and modprobe
in the container was not compiled with compression support
* Workaround issue for now by always loading ip_vs on-host
* https://github.com/kubernetes/kubernetes/issues/60
2018-04-25 21:45:58 -07:00
d784b0fca6 Switch to quay.io/poseidon tagged system containers 2018-04-25 18:15:18 -07:00
cd913986df Write documentation for Fedora Atomic 2018-04-24 01:10:27 -07:00
af54efec28 Organize docs by operating system 2018-04-23 19:55:28 -07:00
7198b9016c Update Calico from v3.0.4 to v3.1.1 for Atomic 2018-04-21 18:46:56 -07:00
f36c890234 Fix ostree repo to be called fedora-atomic on bare-metal
* atomic host updates were fetching updates from the repo cache
fedora-atomic-27, instead of from upstream
2018-04-21 18:46:56 -07:00
233ec6dcb0 Update Fedora Atomic AMI to version 27.122
* http://www.projectatomic.io/blog/2018/04/fedora-atomic-20-apr-18/
* Atomic publishes nightly AMIs which sometimes don't boot
or have issues. Until there is a source of reliable AMIs,
pin the best known working AMI
* Rel 66a66f0d18544591ffdbf8fae9df790113c93d72
2018-04-21 18:46:56 -07:00
3f2978821b Add atomic_assets_endpoint var for fedora-atomic bare-metal 2018-04-21 18:46:56 -07:00
9b88d4bbfd Use bootkube system container on fedora-atomic
* Use the upstream bootkube image packaged with the
required metadata to be usable as a system container
under systemd
* Run bootkube with runc so no host level components
use Docker any more. Docker is still the runtime
* Remove bootkube script and old systemd unit
2018-04-21 18:46:56 -07:00
3dde4ba8ba Mount host's /etc/os-release in kubelet system containers
* Fix `kubectl describe node` to reflect the host's operating
system
2018-04-21 18:46:56 -07:00
e148552220 Enable kubelet allocatable enforcement and QoS cgroup hierarchy
* Change kubelet system image to use --cgroups-per-qos=true
(default) instead of false
* Change kubelet system image to use --enforce-node-allocatable=pods
instead of an empty string
2018-04-21 18:46:56 -07:00
d8d1468f03 Update kubelet system container image to mount /etc/hosts
* Fix kubelet port-forward on Google Cloud / Fedora Atomic
* Mount the host's /etc/hosts in kubelet system containers
* Problem: kubelet runc system containers on Atomic were not
mounting the host's /etc/hosts, like rkt-fly does on Container
Linux. `kubectl port-forward` calls socat with localhost. DNS
servers on AWS, DO, and in many bare-metal environments resolve
localhost to the caller as a convenience. Google Cloud notably
does not nor is it required to do so and this surfaced the
missing /etc/hosts in runc kubelet namespaces.
2018-04-21 18:46:56 -07:00
2b74aba564 Add Google Cloud fedora-atomic module
* Network load balancer for ingress doesn't work yet
because Compute Engine packages are missing
* port-forward / socat is broken
2018-04-21 18:46:56 -07:00
24d230505a Add cloud-metadata.service on AWS fedora-atomic 2018-04-21 18:46:56 -07:00
cf22e70b46 Name ostree remote repo fedora-atomic across platforms 2018-04-21 18:46:56 -07:00
b3cf9508b6 Update Fedora Atomic modules to Kubernetes v1.10.1 2018-04-21 18:46:56 -07:00
5212684472 Temporarily pin Fedora Atomic AMI
* Atomic has published AMI images that shutdown
immediately after being powered on
2018-04-21 18:46:56 -07:00
f990473cde Update control plane manifests and add etcd metrics
* Enable etcd v3.3 metrics to expose metrics for
scraping by Prometheus
* Use k8s.gcr.io instead of gcr.io/google_containers
* Add flexvolume plugin mount to controller manager
* Update kube-dns from v1.14.8 to v1.14.9
2018-04-21 18:46:56 -07:00
8523a086e2 Fix kubelet system container to mount CNI plugins
* Mount /opt/cni/bin in kubelet system container so
CNI plugin binaries can be found. Before, flannel
worked because the kubelet falls back to flannel
plugin baked into the hyperkube (undesired)
* Move the CNI bin install location later, since /opt
changes may be lost between ostree rebases
2018-04-21 18:46:56 -07:00
19bc5aea9e Use kubelet system container on fedora-atomic
* Use the upstream hyperkube image packaged with the
required metadata to be usable as a system container
under systemd
* Fix port-forward since socat is included
2018-04-21 18:46:56 -07:00
8d7cfc1a45 Use etcd system container on fedora-atomic
* Use the upstream etcd image packaged with the required
metadata to be usable as a system container (runc) under
systemd
2018-04-21 18:46:56 -07:00
9969c357da Change AWS Fedora module to fedora-atomic 2018-04-21 18:46:56 -07:00
4e43b2ff48 Change DO Fedora module to fedora-atomic 2018-04-21 18:46:56 -07:00
ddc75e99ac Add bare-metal Fedora Atomic module
* Several known hacks and broken areas
* Download v1.10 Kubelet from release tarball
* Install flannel CNI binaries to /opt/cni
* Switch SELinux to Permissive
* Disable firewalld service
* port-forward won't work, socat missing
2018-04-21 18:46:56 -07:00
b80a2eb8a0 Sync fedora-cloud modules with Container Linux
* Update manifests for Kubernetes v1.10.0
* Update etcd from v3.3.2 to v3.3.3
* Add disk_type optional variable on AWS
* Remove redundant kubeconfig copy on AWS
* Distribute etcd secres only to controllers
* Organize module variables and ssh steps
2018-04-21 18:46:56 -07:00
3610da8b71 Add fedora-cloud module for AWS 2018-04-21 18:46:56 -07:00
485586e5d8 Add fedora-cloud module for Digital Ocean 2018-04-21 18:46:56 -07:00
a54f76db2a Update Calico from v3.0.4 to v3.1.1
* https://github.com/projectcalico/calico/releases/tag/v3.1.1
* https://github.com/projectcalico/calico/releases/tag/v3.1.0
2018-04-21 18:30:36 -07:00
e0d9e9979c Update nginx-ingress from 0.12.0 to 0.13.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.13.0
2018-04-18 21:12:09 -07:00
ad2e4311d1 Switch GCP network lb to global TCP proxy lb
* Allow multi-controller clusters on Google Cloud
* GCP regional network load balancers have a long open
bug in which requests originating from a backend instance
are routed to the instance itself, regardless of whether
the health check passes or not. As a result, only the 0th
controller node registers. We've recommended just using
single master GCP clusters for a while
* https://issuetracker.google.com/issues/67366622
* Workaround issue by switching to a GCP TCP Proxy load
balancer. TCP proxy lb routes traffic to a backend service
(global) of instance group backends. In our case, spread
controllers across 3 zones (all regions have 3+ zones) and
organize them in 3 zonal unmanaged instance groups that
serve as backends. Allows multi-controller cluster creation
* GCP network load balancers only allowed legacy HTTP health
checks so kubelet 10255 was checked as an approximation of
controller health. Replace with TCP apiserver health checks
to detect unhealth or unresponsive apiservers.
* Drawbacks: GCP provision time increases, tailed logs now
timeout (similar tradeoff in AWS), controllers only span 3
zones instead of the exact number in the region
* Workaround in Typhoon has been known and posted for 5 months,
but there still appears to be no better alternative. Its
probably time to support multi-master and accept the downsides
2018-04-18 00:09:06 -07:00
490b628e2d Use relative image links to appear in Github markdown 2018-04-17 23:40:58 -07:00
23a8156bdf Fix a few typos in comments 2018-04-15 17:21:49 -07:00
9789881243 Update kube-state-metrics from v1.3.0 to v1.3.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.3.1
2018-04-15 17:10:02 -07:00
77c0a4cf2e Update Kubernetes from v1.10.0 to v1.10.1
* Use kubernetes-incubator/bootkube v0.12.0
2018-04-12 20:57:31 -07:00
5035d56db2 Refactor GCP to remove controller internal module
* Remove the controller internal module to align with
other platforms and since its not a supported use case
2018-04-12 19:41:51 -07:00
9bb3de5327 Skip creating unused dirs on worker nodes 2018-04-11 22:23:51 -07:00
c8eabc2af4 Fix GCP controller_type and worker_type vars 2018-04-11 22:19:58 -07:00
2eaf858c5c Update example BGPPeer manifest
Previous example may have been outdated. It resulted in `error: unable to recognize "example.yaml": no matches for /, Kind=bgpPeer` .

See https://docs.projectcalico.org/v3.0/reference/calicoctl/resources/bgppeer.
2018-04-09 23:23:18 -05:00
b8656fd74b Clarify bare-metal SSH instructions 2018-04-08 14:11:05 -07:00
d276fffcda Fix bare-metal multiple apply/ssh on Terraform v0.11.4+
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* https://github.com/hashicorp/terraform/pull/17359#issuecomment-376415464
2018-04-08 13:32:31 -07:00
6b08bde479 Use k8s.gcr.io instead of gcr.io/google_containers
* Kubernetes recommends using the alias to fetch images
from the nearest GCR regional mirror, to abstract the use
of GCR, and to drop names containing 'google'
* https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ
2018-04-08 12:57:52 -07:00
f4b2396718 Return Prometheus deployment to be a worker workload
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to https://github.com/poseidon/typhoon/pull/175
2018-04-08 12:20:00 -07:00
b76126db93 Update docs builder and material theme 2018-04-08 00:00:03 -07:00
7186aa46da Update kube-state-metrics from v1.2.0 to v1.3.0
* https://github.com/kubernetes/kube-state-metrics/pull/412
* https://github.com/kubernetes/kube-state-metrics/pull/413
2018-04-04 21:04:13 -07:00
18dbaf74ce Update kube-dns from v1.14.8 to v1.14.9
* https://github.com/kubernetes/kubernetes/pull/61908
2018-04-04 21:00:23 -07:00
ce001e9d56 Update etcd from v3.3.2 to v3.3.3
* https://github.com/coreos/etcd/releases/tag/v3.3.3
2018-04-04 20:32:24 -07:00
d770393dbc Add etcd metrics, Prometheus scrapes, and Grafana dash
* Use etcd v3.3 --listen-metrics-urls to expose only metrics
data via http://0.0.0.0:2381 on controllers
* Add Prometheus discovery for etcd peers on controller nodes
* Temporarily drop two noisy Prometheus alerts
2018-04-03 20:31:00 -07:00
642f7ec22f Update CHANGES.md with Kubernetes link 2018-03-30 23:12:38 -07:00
1cc043d1eb Update Kubernetes from v1.9.6 to v1.10.0 2018-03-30 22:14:07 -07:00
f8e9bfb1c0 Add disk_type variable for EBS volume type on AWS
* Change EBS volume type from `standard` ("prior generation)
 to `gp2`. Prometheus alerts are tuned for SSDs
* Other platforms have fast enough disks by default
2018-03-29 22:51:54 -07:00
b1e41dcb99 addons: Update from Grafana v4.6.3 to v5.0.4
This reverts commit c59a9c66b1.
2018-03-28 19:45:19 -07:00
de4d90750e Use consistent naming of remote provision steps 2018-03-26 00:29:57 -07:00
7acd4931f6 Remove redundant kubeconfig copy on AWS and GCP
* AWS and Google Cloud make use of auto-scaling groups
and managed instance groups, respectively. As such, the
kubeconfig is already held in cloud user-data
* Controller instances are provisioned with a kubeconfig
from user-data. Its redundant to use a Terraform remote
file copy step for the kubeconfig.
2018-03-26 00:01:47 -07:00
cfd603bea2 Ensure etcd secrets are only distributed to controller hosts
* Previously, etcd secrets were erroneously distributed to worker
nodes (permissions 500, ownership etc:etcd).
2018-03-25 23:46:44 -07:00
fdb543e834 Add optional controller_type and worker_type vars on GCP
* Remove optional machine_type variable on Google Cloud
* Use controller_type and worker_type instead
2018-03-25 22:11:18 -07:00
8d3d4220fd Add disk_size variable on Google Cloud 2018-03-25 22:04:14 -07:00
ba9daf439e Remove unmaintained pxe-worker internal module 2018-03-25 21:57:52 -07:00
38adb14bd2 Remove optional variable networking on Digital Ocean
* Calico isn't viable on Digital Ocean because their firewalls
do not support IP-IP protocol. Its not viable to run a cluster
without firewalls just to use Calico.
* Remove the caveat note. Don't allow users to shoot themselves
in the foot
2018-03-25 21:48:51 -07:00
e43cf9f608 Organize and cleanup variable descriptions 2018-03-25 21:44:43 -07:00
455a4af27e Improve cluster definition examples in docs 2018-03-25 20:41:52 -07:00
39876e455f Fix docs to reflect enforced provider versions 2018-03-25 11:34:39 -07:00
da2be86e8c Add v1.9.6 heading to CHANGES.md 2018-03-22 22:01:29 -07:00
65a2751f77 addons: Update heapster from v1.5.1 to v1.5.2
* https://github.com/kubernetes/heapster/releases/tag/v1.5.2
2018-03-21 20:32:01 -07:00
a04ef3919a Update Kubernetes from v1.9.5 to v1.9.6 2018-03-21 20:29:52 -07:00
851bc1a3f8 Update nginx-ingress from 0.11.0 to 0.12.0 2018-03-19 23:17:17 -07:00
758c09fa5c Update Kubernetes from v1.9.4 to v1.9.5 2018-03-19 00:25:44 -07:00
b1cdd361ef Mention controllers node label in changelog 2018-03-19 00:15:56 -07:00
7f7bc960a6 Set default Google Cloud os_image to coreos-stable 2018-03-19 00:08:26 -07:00
29108fd99d Improve changelog with migration links 2018-03-18 23:54:55 -07:00
18d08de898 Add Container Linux Config snippet docs 2018-03-18 23:22:40 -07:00
f3730b2bfa Add Container Linux Config snippets feature
* Introduce the ability to support Container Linux Config
"snippets" for controllers and workers on cloud platforms.
This allows end-users to customize hosts by providing Container
Linux configs that are additively merged into the base configs
defined by Typhoon. Config snippets are validated, merged, and
show any errors during `terraform plan`
* Example uses include adding systemd units, network configs,
mounts, files, raid arrays, or other disk provisioning features
provided by Container Linux Configs (using Ignition low-level)
* Requires terraform-provider-ct v0.2.1 plugin
2018-03-18 18:28:18 -07:00
88aa9a46e5 Add /var/lib/calico volume mount to Calico DaemonSet 2018-03-18 16:40:38 -07:00
efa90d8b44 Add a new key=value label to controller nodes
* Add a node-role.kubernetes.io/controller="true" node label
to controllers so Prometheus service discovery can filter to
services that only run on controllers (i.e. masters)
* Leave node-role.kubernetes.io/master="" untouched as its
a Kubernetes convention
2018-03-18 16:39:10 -07:00
46226a8015 Update Prometheus from 2.2.0 to 2.2.1 2018-03-18 15:56:44 -07:00
270d1ce357 Add links to upstream regressions 2018-03-14 18:56:20 -07:00
ab87b6cea3 Add clarifying links to CHANGES 2018-03-12 21:19:15 -07:00
d621512dd6 Promote AWS platform from beta to stable 2018-03-12 21:15:53 -07:00
c59a9c66b1 Revert "addons: Update from Grafana v4.6.3 to v5.0.0"
* Revert commit 9dcc255f8e.
* Grafana v5.0 is not compatible with Kubernetes v1.9.4. See
https://github.com/poseidon/typhoon/pull/162
2018-03-12 21:01:14 -07:00
21f2cef12f Improve changelog, README, and index page 2018-03-12 20:58:02 -07:00
931e311786 Update Kubernetes from v1.9.3 to v1.9.4 2018-03-12 18:07:50 -07:00
2592a0aad4 Allow Google accelerators (i.e. GPUs) on workers 2018-03-11 17:21:24 -07:00
6c5e287c29 Add details and links to the changelog 2018-03-11 17:07:07 -07:00
2a4595eeee Add links to the charitable donations list 2018-03-11 14:51:40 -07:00
8e7e6b9f7f Normalize Terraform configs with terraform fmt 2018-03-11 14:46:05 -07:00
35f3b1b28c Enable AWS NLB cross-zone load balancing
* https://github.com/terraform-providers/terraform-provider-aws/pull/3537
* https://aws.amazon.com/about-aws/whats-new/2018/02/network-load-balancer-now-supports-cross-zone-load-balancing/
2018-03-10 23:25:18 -08:00
9fb1e1a0e2 Update etcd from v3.3.1 to v3.3.2
* https://github.com/coreos/etcd/releases/tag/v3.3.2
2018-03-10 13:44:35 -08:00
b61d6373c5 Add ignore_changes for AWS worker image_id 2018-03-10 13:16:05 -08:00
42708f9a70 Update Prometheus from v2.2.0-rc.1 to v2.2.0
* https://github.com/prometheus/prometheus/releases/tag/v2.2.0
2018-03-09 00:20:40 -08:00
d54709f89c Update Grafana from v5.0.0 to 5.0.1
* https://github.com/grafana/grafana/releases/tag/v5.0.1
2018-03-09 00:20:40 -08:00
0e688ef05a Update CHANGES.md changelog with monitoring updates 2018-03-09 00:20:40 -08:00
9dcc255f8e addons: Update from Grafana v4.6.3 to v5.0.0 2018-03-09 00:20:40 -08:00
9307e97c46 addons: Update Prometheus from v2.1.0 to v2.2.0
* Annotate Prometheus service to scrape metrics from
Prometheus itself (enables Prometheus* alerts)
* Update kube-state-metrics addon-resizer to 1.7
* Use port 8080 for kube-state-metrics
* Add PrometheusNotIngestingSamples alert rule
* Change K8SKubeletDown alert rule to fire when 10%
of kubelets are down, not 1%
  * https://github.com/coreos/prometheus-operator/pull/1032
2018-03-09 00:20:40 -08:00
c112ee3829 Rename cluster_name to name in internal module
* Ensure consistency between AWS and GCP platforms
2018-03-03 17:52:01 -08:00
45b556c08f Fix overly strict firewall for GCP "worker pools"
* Fix issue where worker firewall rules didn't apply to
additional workers attached to a GCP cluster using the new
"worker pools" feature (unreleased, #148). Solves host
connection timeouts and pods not being scheduled to attached
worker pools.
* Add `name` field to GCP internal worker module to represent
the unique name of of the worker pool
* Use `cluster_name` field of GCP internal worker module for
passing the name of the cluster to which workers should be
attached
2018-03-03 17:40:17 -08:00
da6aafe816 Revert "Add module version requirements to internal workers modules"
* This reverts commit cce4537487.
* Provider passing to child modules is complex and the behavior
changed between Terraform v0.10 and v0.11. We're continuing to
allow both versions so this change should be reverted. For the
time being, those using our internal Terraform modules will have
to be aware of the minimum version for AWS and GCP providers,
there is no good way to do enforcement.
2018-03-03 16:56:34 -08:00
cce4537487 Add module version requirements to internal workers modules 2018-03-03 14:39:25 -08:00
73126eb7f8 Add support for worker pools on AWS
* Allow groups of workers to be defined and joined to
a cluster (i.e. worker pools)
* Move worker resources into a Terraform submodule
* Output variables needed for passing to worker pools
* Add usage docs for AWS worker pools (advanced)
2018-02-27 18:31:42 -08:00
160ae34e71 Add support for worker pools on google-cloud
* Set defaults for internal worker module's count,
machine_type, and os_image
* Allow "pools" of homogeneous workers to be created
using the google-cloud/kubernetes/workers module
2018-02-26 22:36:36 -08:00
06d40c5b44 Show os_image coreos-stable on Google Cloud
* Don't need to define a specific dated image. Managed
instance groups do not delete instances when new images
are released to a channel
2018-02-26 22:24:44 -08:00
98985e5acd Remove unused etcd_service_ip template variable
* etcd_service_ip dates back to deprecated self-hosted etcd
2018-02-26 22:20:20 -08:00
ea6bf9c9fb Improve links in tutorials and changelog notes 2018-02-26 12:55:32 -08:00
486fdb6968 Simplify CLC kubeconfig templating on AWS and GCP
* Template terraform-render-bootkube's multi-line kubeconfig
output using the right indentation
* Add `kubeconfig` variable to google-cloud controllers and
workers Terraform submodules
* Remove `kubeconfig_*` variables from google-cloud controllers
and workers Terraform submodules
2018-02-26 12:49:01 -08:00
a44cf0edbd Update Calico from v3.0.2 to v3.0.3
* https://github.com/projectcalico/calico/releases/tag/v3.0.3
2018-02-26 12:48:19 -08:00
983c7aa012 Recommend installing terraform-provider-ct v0.2.1
* Upcoming releases may begin to use features that require
the `terraform-provider-ct` plugin v0.2.1
* New users should use `terraform-provider-ct` v0.2.1. Existing
users can safely drop-in replace their v0.2.0 plugin with v0.2.1
as well (location referenced in ~/.terraformrc).
* See https://github.com/poseidon/typhoon/pull/145
2018-02-25 19:39:54 -08:00
3d9683b6e8 Update the Digital Ocean SSH fingerprint docs 2018-02-25 19:09:38 -08:00
0da7757ef4 Pass Digital Ocean ssh_fingerprints as a list
* Fix digital-ocean module to pass ssh_fingerprints
as a list since the module accepts a list
2018-02-25 19:03:33 -08:00
04c6613ff3 Mention the command that applies the changes 2018-02-25 17:15:42 -08:00
92600efd11 Remove author employment disclosure note
* Author no longer works for CoreOS / Red Hat
* Typhoon development continues as usual
2018-02-24 18:30:51 -08:00
66c64b4e45 List addons below platforms in CHANGES 2018-02-22 22:33:13 -08:00
13f3745093 Add kubelet --volume-plugin-dir flag
* Set Kubelet search path for flexvolume plugins
to /var/lib/kubelet/volumeplugins
* Add support for flexvolume plugins on AWS, GCE, and DO
* See 9548572d98 which added flexvolume support for bare-metal
2018-02-22 22:11:45 -08:00
c4914c326b Update bootkube and terraform-render-bootkube to v0.11.0 2018-02-22 21:53:26 -08:00
461fd46986 Update CHANGES.md with AWS ELB to NLB change 2018-02-22 21:36:35 -08:00
ceb5555222 Switch apiserver from ELB to a network load balancer 2018-02-22 16:10:31 -08:00
86420fd507 Rename namespace manifests to be applied first
* Ensure kubectl apply -R creates manifests in the right order
2018-02-22 01:04:30 -08:00
5c383f4184 addons: Update nginx-ingress from 0.10.2 to 0.11.0 2018-02-21 23:54:12 -08:00
22fa051002 Switch Ingress ELB to a network load balancer
* Require terraform-provider-aws 1.7 or higher
2018-02-20 17:34:38 -08:00
c8313751d7 Ignore lifecycle changes to the AWS controller ami 2018-02-15 19:48:39 -08:00
195d902ab6 Upgrade etcd from v3.2.15 to v3.3.1 2018-02-15 19:29:46 -08:00
c19a68b59b Update bootkube control-plane manifests
* Remove PersistentVolumeLabel admission controller flag
* Switch Deployments and DaemonSets to apps/v1
* Minor update to pod-checkpointer image version
2018-02-15 11:06:35 -08:00
de88fa5457 addons: Update Heapster from v1.5.0 to v1.5.1
* Switch to k8s.gcr.io vanity image name
* Add service account, Role, and ClusterRole for heapster
2018-02-15 10:57:47 -08:00
d9a0183f3f addons/nginx-ingress: Fix typo in GCP selector name 2018-02-14 03:07:36 -05:00
7e24c67608 Remove docs mention of the etcd-network-checkpointer
* etcd-network-checkpointer is no longer used, its a holdover
from the self-hosted etcd era
2018-02-13 16:19:03 -08:00
a37aff7f35 Update CHANGELOG.md for v1.9.3 2018-02-11 10:59:16 -08:00
03d23bfde7 addons: Remove Kubernetes Dashboard manifests and docs
* Stop maintaining Kubernetes Dashboard manifests. Dashboard takes
an unusual approch to security and is often a security weak point.
* Recommendation: Use `kubectl` and avoid using the dashboard. If
you must use the dashboard, explore hardening and consider using an
authenticating proxy rather than the dashboard's auth features
2018-02-11 10:33:23 -08:00
2c10d24113 addons: Switch to apps/v1 workload APIs
* Deployments now belong to the apps/v1 API group
* DaemonSets now belong to the apps/v1 API group
* RBAC types now belong to the rbac.authorization.k8s.io/v1 API group
2018-02-10 23:56:31 -08:00
82a616c70b Fix terraform config formatting 2018-02-10 15:18:27 -08:00
2fa7dac247 List aws platform in the Github issue template 2018-02-10 15:16:42 -08:00
a41691b222 Update Kubernetes from v1.9.2 to v1.9.3
* Add flannel service account and limited RBAC cluster role
* Change DaemonSets to tolerate NoSchedule and NoExecute taints
* Remove deprecated apiserver --etcd-quorum-read flag
* Update Calico from v3.0.1 to v3.0.2
* Add Calico GlobalNetworkSet CRD
* https://github.com/poseidon/terraform-render-bootkube/pull/44
2018-02-10 13:37:07 -08:00
9034203d7a Fix typo in list of maps comment 2018-02-09 19:11:06 -08:00
d42f6d6b5d Update author's employment disclosure
* Typhoon remains independently maintained. Its goals
remain unchanged
2018-01-30 15:00:07 -08:00
2fa1840c30 Update flannel from v0.9.0 to v0.10.0
* https://github.com/coreos/flannel/releases/tag/v0.10.0
2018-01-28 23:09:21 -08:00
8e0b8d7e40 Upgrade Calico from 2.6.6 to 3.0.1 2018-01-28 11:47:23 -08:00
a0cf527ccf Update changelog with recent addon improvements 2018-01-28 01:24:27 -08:00
65321acad2 addons: Add grafana-watcher and bundle dashboards
* Add separate Grafana addons docs and screenshots
2018-01-28 01:01:30 -08:00
064ce83f25 addons: Update Prometheus to v2.1.0
* Change service discovery to relabel jobs to align with
rule expressions in upstream examples
* Use a separate service account for prometheus instead
of granting roles to the namespace's default
* Use a separate service account for node-exporter
* Update node-exporter and kube-state-metrics exporters
2018-01-27 21:00:15 -08:00
c3b0cdddf3 addons: Update nginx-ingress from v0.10.1 to v0.10.2 2018-01-26 17:27:36 -08:00
211ec94c75 addons: Update CLUO from v0.5.0 to v0.6.0
* https://github.com/coreos/container-linux-update-operator/releases/tag/v0.6.0
2018-01-26 17:24:09 -08:00
8aca5a089e addons: Update nginx-ingress to 0.10.1 2018-01-24 20:34:05 -08:00
3e6e4ea339 Update etcd from 3.2.14 to 3.2.15
* https://github.com/coreos/etcd/releases/tag/v3.2.15
2018-01-23 23:50:04 -08:00
103f1e16d7 addons: Update nginx-ingress to 0.10.0 2018-01-23 09:28:37 -08:00
50dd3e3b82 Update Digital Ocean variables / docs to use new droplet sizes 2018-01-20 20:41:13 -05:00
3dc755994b Add missing changelog entry for Digital Ocean fix 2018-01-20 07:52:40 -08:00
ddbfb2eee1 Set module version tutorials docs for good practice 2018-01-19 23:16:48 -08:00
868265988b Update bootkube and terraform-render-bootkube to v0.10.0 2018-01-19 23:10:45 -08:00
6adffcb778 Update Kubernetes from v1.9.1 to v1.9.2 2018-01-19 08:40:09 -08:00
bc967ddcd0 addons: Update CLUO to fix compatability with Kubernetes 1.9
* Update CLUO from v0.4.1 to v0.5.0
* Earlier versions of CLUO fail to drain nodes on Kubernetes 1.9
so nodes drain one at a time repeatedly and Container Linux OS
updates are not applied to nodes.
* Check current OS versions via `kubectl get nodes --show-labels`
2018-01-19 08:33:26 -08:00
ef18f19ec4 Edit digital ocean port range and ordering to suppress diff
* Change port range from keyword "all" to "1-65535", which is the
same but with digitalocean provider 0.1.3 doesn't produce a diff
* Rearrange egress firewall rules to order the Digtial Ocean API
and provider returns. In current testing, this fixes the last diff
that was present on `terraform plan`.
2018-01-15 22:13:59 -08:00
f5efcc1ff8 Relax digitalocean provider version constraints
* Relax fixed 0.1.2 version constraint to "~> 0.1.2", which
allows 0.1.3, 0.1.4, etc, but would not allow 0.2.0
2018-01-15 21:04:53 -08:00
996651c605 Update kube-state-metrics version and RBAC cluster role
* https://github.com/kubernetes/kube-state-metrics/pull/345
* https://github.com/kubernetes/kube-state-metrics/pull/334
2018-01-15 08:33:44 -08:00
38fa7dff1a Create separate bare-metal container-linux-install profiles
* Create separate container-linux-install profiles (and
cached-container-linux-install) for each node in a cluster
* Fix contention bug on bare-metal during `terraform apply`.
With only a global install profile, terraform would create
(or retain) the profile for each cluster and try to delete
it for each cluster being deleted. As a result, in some cases
apply had to be run multiple times before terraform's repr
of constraints was satisfied (profile deleted and recreated)
* Allow Container Linux install properties to vary between
clusters, such as using a different Container Linux channel
or version for different clusters
2018-01-15 08:23:03 -08:00
bbe295a3f1 Add Terraform v0.11.x support and migration docs
* Add explicit "providers" section to modules for Terraform v0.11.x
* Retain support for Terraform v0.10.4+
* Add migration guide from Terraform v0.10.x to v0.11.x for those managing
existing clusters (action required!)
2018-01-13 15:30:08 -08:00
d8db296932 Update kube-dns and use separate service account
* Update kube-dns from v1.14.7 to v1.14.8
* Use a separate kube-dns service account
* https://github.com/kubernetes/kubernetes/pull/57918
2018-01-12 10:29:30 -08:00
388ac08492 Update etcd from 3.2.13 to 3.2.14
* https://github.com/coreos/etcd/releases/tag/v3.2.14
2018-01-12 07:20:55 -08:00
527b5ca602 Update CHANGELOG.md for v1.9.1 2018-01-09 07:03:04 -08:00
ecd6a9443b Add maintenance docs with upgrade policies
* Add best practices for maintenance
* Describe blue-green replacement strategy
* Mention unsupported in-place edit and
node replacement strategies
2018-01-09 06:54:44 -08:00
2523d64f95 Fix docs to show exporting KUBECONFIG 2018-01-06 16:55:06 -08:00
fc455c8624 Remove old mention of ACIs in bootkube.service description 2018-01-06 16:20:34 -08:00
7a0a60708e Bump Container Linux version shown in docs
* Be sure docs and examples list Container Linux versions that
have been patched for Meltdown just in case someone copy-pastes
or sees them as recent versions
2018-01-06 14:58:38 -08:00
51a5f64024 Enable portmap plugin alongside Calico to fix hostPort
* https://github.com/poseidon/terraform-render-bootkube/pull/36
2018-01-06 14:01:18 -08:00
e1f2125f02 Update etcd from 3.2.0 to 3.2.13
* https://github.com/coreos/etcd/releases/tag/v3.2.13
2018-01-06 14:01:18 -08:00
9329b775f6 Update Kubernetes from v1.8.6 to v1.9.1 2018-01-06 14:01:16 -08:00
e04cce1201 Update mkdocs and material docs theme 2018-01-06 10:59:56 -08:00
201a38bd90 Update CHANGELOG.md for v1.8.6 2017-12-22 13:00:18 -08:00
fbdd946601 Update Kubernetes from v1.8.5 to v1.8.6 2017-12-21 11:20:37 -08:00
19102636a9 Add link to dashboard 315 2017-12-15 18:52:40 -08:00
21e540159b addons: Update grafana from v4.6.2 to v4.6.3
* https://github.com/grafana/grafana/releases/tag/v4.6.3
2017-12-15 16:09:14 -08:00
43e65a4d13 Update CHANGELOG.md for v1.8.5 2017-12-15 02:04:13 -08:00
e79088baa0 Add optional cluster_domain_suffix variable
* Allow kube-dns to respond to DNS queries with a custom
suffix, instead of the default 'cluster.local'
* Useful when multiple clusters exist on the same local
network and wish to query services on one another
2017-12-15 01:45:52 -08:00
495e33e213 Update bootkube and terraform-render-bootkube to v0.9.1 2017-12-15 01:45:02 -08:00
63f5a26a72 Eliminate steps to move self-hosted etcd assets
* bootkube/assets/experimental/* assets corresponded to self-hosted
etcd manifests, which are no longer an option in Typhoon
2017-12-13 01:06:56 -08:00
eea79e895d Fix manifest consolidation in bootkube start wrapper
* Fix manifest existence test in /opt/bootkube/bootkube-start
to also work with more than one directory
2017-12-12 23:08:22 -08:00
99c07661c6 Fix old Container Linux versions mentioned in docs 2017-12-11 23:36:16 -08:00
521a1f0fee addons: Update heapster from v1.4.3 to v1.5.0
* Rollback addon-resizer to 1.7 to address issues in large
clusters https://github.com/kubernetes/kubernetes/pull/52536
2017-12-11 23:34:25 -08:00
7345cb6419 addons: Update nginx-ingress to 0.9.0 2017-12-11 00:48:15 -08:00
a481d71d7d addons: Update nginx-ingress to 0.9.0-beta.19
* Undo rollback f00ecde854
* Port binding regression only occurs with --enable-ssl-passthrough,
which isn't used in these examples. See
https://github.com/kubernetes/ingress-nginx/issues/1788
2017-12-11 00:44:32 -08:00
831a5c976c Add Kubernetes Dashboard warning and improve changelog 2017-12-09 22:38:27 -08:00
85e6783503 Recommend Container Linux images with Docker 17.09
* Container Linux stable and beta now provide Docker 17.09 (instead
of 1.12). Recommend images which provide 17.09.
* Older clusters (with CLUO addon) auto-update node's Container Linux version
and will begin using Docker 17.09.
2017-12-09 22:14:13 -08:00
165396d6aa Update Kubernetes from v1.8.4 to v1.8.5 2017-12-09 21:28:31 -08:00
ce49a93d5d Fix issue with etcd-member failing to resolve peers
* When restarting masters, `etcd-member.service` may fail to lookup peers if
/etc/resolv.conf hasn't been populated yet. Require the wait-for-dns.service.
2017-12-09 20:12:49 -08:00
e623439eec Fix typos in docs and CONTRIBUTING.md 2017-12-09 19:58:09 -08:00
9548572d98 Add kubelet --volume-plugin-dir flag on bare-metal
* Kubelet will search path for flexvolume plugins
2017-12-05 13:12:53 -08:00
f00ecde854 Rollback nginx-ingress on GCE to 0.9.0-beta.17
* https://github.com/kubernetes/ingress-nginx/issues/1788
2017-12-02 14:06:22 -08:00
d85300f947 Clarify only Terraform v0.10.x should be used
* It is not safe to update to Terraform v0.11.x yet
* https://github.com/hashicorp/terraform/issues/16824
2017-12-02 01:31:39 -08:00
65f006e6cc addons: Sync prometheus alerts to upstream
* https://github.com/coreos/prometheus-operator/pull/774
2017-12-01 23:24:08 -08:00
8d3817e0ae addons: Update nginx-ingress to 0.9.0-beta.19
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.19
2017-12-01 22:32:33 -08:00
5f5eec1175 Update bootkube and terraform-render-bootkube to v0.9.0 2017-12-01 22:27:48 -08:00
5308fde3d3 Add Kubernetes certification badge 2017-11-29 19:26:49 -08:00
9ab61d7bf5 Add Typhoon images with and without text
* Serve images from GCS poseidon, rather than dghubble
2017-11-29 01:01:01 -08:00
6483f613c5 Update Kubernetes from v1.8.3 to v1.8.4 2017-11-28 21:52:11 -08:00
56c6bf431a Update terraform-render-bootkube for Kubernetes v1.8.4
* Update hyperkube from v1.8.3 to v1.8.4
* Remove flock from bootstrap-apiserver and kube-apiserver
* Remove unused critical-pod annotations in manifests
* Use service accounts for kube-proxy and pod-checkpointer
* Update Calico from v2.6.1 to v2.6.3
* Update flannel from v0.9.0 to v0.9.1
* Remove Calico termination grace period to prevent calico
from getting stuck for extended periods
* https://github.com/poseidon/terraform-render-bootkube/pull/29
2017-11-28 21:42:26 -08:00
63ab117205 addons: Add prometheus rules for DaemonSets
* https://github.com/coreos/prometheus-operator/pull/755
2017-11-16 23:51:21 -08:00
1cd262e712 addons: Fix prometheus K8SApiServerLatency alert rule
* https://github.com/coreos/prometheus-operator/issues/751
2017-11-16 23:37:15 -08:00
32bdda1b6c addons: Update Grafana from v4.6.1 to v4.6.2
* https://github.com/grafana/grafana/releases/tag/v4.6.2
2017-11-16 23:34:36 -08:00
07d257aa7b Add initrd kernel argument needed by UEFI clients
* https://github.com/coreos/bugs/issues/1239
2017-11-16 23:19:51 -08:00
fd96067125 Fix docs link for security issue reporting 2017-11-10 21:38:41 -08:00
9d16f5c78a Update min Google plugin and remove target pool workaround
* With google provider 1.2, target pool instances can use self_link
and zone/name formats without causing a diff on each plan
* Original workaround: 77fc14db71
2017-11-10 21:15:19 -08:00
159443bae7 addons: Add better alerting rules to Prometheus manifests
* Adapt the coreos/prometheus-operator alerting rules for Typhoon,
https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus/manifests
* Add controller manager and scheduler shim services to let
prometheus discover them via service endpoints
* Fix several alert rules to use service endpoint discovery
* A few rules still don't do much, but they default to green
2017-11-10 20:57:47 -08:00
119dc859d3 addons: Update nginx-ingress to 0.9.0-beta.17
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.17
2017-11-10 20:16:40 -08:00
5f6b0728c5 Update bootkube and terraform-render-bootkube to v0.8.2 2017-11-10 20:01:37 -08:00
d774c51297 Update Kubernetes from v1.8.2 to v1.8.3 2017-11-08 23:34:19 -08:00
f6a8fb363e Remove deprecated kubelet --require-kubeconfig flag
* https://github.com/kubernetes/kubernetes/pull/40050
2017-11-08 23:34:19 -08:00
f570af9418 addons: Update from Prometheus v1.8.2 to v2.0.0 2017-11-08 22:48:23 -08:00
4ec6732b98 Output the Google network name and self_link
* Allow users to add custom firewall rules for unique cases
2017-11-08 00:19:49 -08:00
ea1efb536a Remove old firewall rule for bootstrap self-hosted etcd 2017-11-08 00:15:20 -08:00
451fd86470 Improve internal firewall rules on Google Cloud
* Whitelist internal traffic between controllers and workers
* Switch to tag-based firewall policies rather than source IP
2017-11-08 00:15:06 -08:00
b1b611b22c Add docs to use one controller on Google Cloud 2017-11-07 19:51:03 -08:00
eabf00fbf1 Add missing controller dependency before bootkube start
* Require the controller module to be completed before starting
to remote exec bootkube start, otherwise its possible the controller
nodes were created, but not the network load balancer
2017-11-07 19:12:05 -08:00
8eaa72c1ca addons: Update nginx-ingress to 0.9.0-beta.16
* Image registry changed from gcr.io to quay.io
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.16
2017-11-06 23:15:15 -08:00
58cf82da56 Promote AWS platform from alpha to beta 2017-11-06 21:38:24 -08:00
ccc832f468 Add firewall rule to allow apiserver to proxy other controller kubelets
* Prometheus proxies through the apiserver to scrape kubelets
* In multi-controller setups, an apiserver must be able to scrape
kubelets (10250) on other controllers
2017-11-06 01:03:53 -08:00
90f8d62204 Add firewall rules to allow prometheus to reach node-exporter
* node_exporter service endpoints run on hostNetwork port 9100
* Re-evaluate after https://github.com/kubernetes-incubator/bootkube/pull/711
2017-11-06 01:03:53 -08:00
af5c413abf Focus controller ELB on load balancing apiservers
* ELB distributing load across controllers is no longer the mechanism
used to SSH to instances to distribute secrets
* Focus the ELB on load balancing across apiserver and edit the HTTP
health check to an SSL:443 check
2017-11-06 01:03:53 -08:00
168c487484 Remove mention of self-hosted etcd, its deprecated 2017-11-06 01:03:53 -08:00
805dd772a8 Run etcd cluster on-host, across controllers on AWS
* Change controllers ASG to heterogeneous EC2 instances
* Create DNS records for each controller's private IP for etcd
* Change etcd to run on-host, across controllers (etcd-member.service)
* Reduce time to bootstrap a cluster
* Deprecate self-hosted-etcd on the AWS platform
2017-11-06 01:03:53 -08:00
c6ec6596d8 Minor cleanup for zones, docs, and outputs
* Spread across all zones, regardless of UP/DOWN state
* Remove unused outputs of private IPs
2017-11-06 00:56:26 -08:00
47a9989927 Fix null_resource ordering constraints
* Ensure etcd TLS assets and kubeconfig are copied before
any attempt is made to run bootkube start
2017-11-06 00:55:44 -08:00
10b977d54a addons: Set kube-state-metrics to have clusterIP None
* kube-state-metrics service exists to facilitate prometheus discovery
2017-11-05 17:54:09 -08:00
b7a268fc45 addons: Add prometheus alertmanager flag
* Pass -alertmanager.url to work with a user's in-cluster
alertmanager deployment, if any
2017-11-05 15:50:46 -08:00
279f36effd addons: Add grafana 4.6.1 and extend prometheus docs 2017-11-05 15:23:56 -08:00
77fc14db71 Workaround target pool issue by listing instances as zone/name
* Instances can be listed by zone/name or self_link URL, but the
provider desires they be in zone/name form, which causes a diff
* https://github.com/terraform-providers/terraform-provider-google/issues/46
2017-11-05 14:07:05 -08:00
2b0296d671 Create controller instances across zones in the region
* Change controller instances to automatically span zones in a region
* Remove the `zone` required variable
2017-11-05 13:24:32 -08:00
7b38271212 Run etcd cluster on-host, across controllers on Google Cloud
* Change controllers from a managed group to individual instances
* Create discrete DNS records to each controller's private IP for etcd
* Change etcd to run on-host, across controllers (etcd-member.service)
* Reduce time to bootstrap a cluster
* Deprecate self-hosted-etcd on the Google Cloud platform
2017-11-05 11:03:35 -08:00
ae07a21e3d addons: Omit static resource requests/limits for kube-state-metrics
* Allow the addon-resizer to dynamically set resource values
* https://github.com/kubernetes/kube-state-metrics/pull/285
2017-11-04 14:41:04 -07:00
0ab1ae3210 addons: Fix typo in kube-state-metrics strategy 2017-11-04 14:39:56 -07:00
67e3d2b86e docs: GCE network bandwidth is excellent, even btw zones
* Remove performance note that the GCE vs AWS network performance
is not an equal comparison. On both platforms, workers now span the
(availability) zones of a region.
* Testing host-to-host and pod-to-pod network bandwidth between nodes
(now located in different zones) showed no reduction in bandwidth
2017-11-04 14:08:20 -07:00
a48dd9ebd8 Require google provider version ~> 1.1
* Require google provider plugin 1.1 or higher which includes fix:
https://github.com/terraform-providers/terraform-provider-google/issues/574
* Remove workaround which statically set the persistent disk name
* Original reasons for workaround in a97df839 or GH #34
2017-11-04 12:59:19 -07:00
26a291aef4 Remove controller_preemptible option on Google Cloud
* Controller preemption is not safe or covered in documentation. Delete
the option, the variable is a holdover from old experiments
* Note, worker_preemeptible is still a great feature that's supported
2017-11-04 12:59:19 -07:00
251a14519f Fix typo in internal template variable name
* ssh_authorized_keys should be ssh_authorized_key to match the user
facing variable which only allows a single SSH authorized key
2017-11-04 12:59:19 -07:00
6300383b43 Change worker managed instance group to span zones in region
* Change Google Cloud module to require the `region` variable
* Workers are created in random zones within the given region
* Tolerate Google Cloud zone failures or capacity issues
* If workers are preempted (if enabled), replacement instances can
be drawn from any zone in the region, which should avoid scheduling
issues that were possible before if a single zone aggressively
preempts instances (presumably due to Google Cloud capacity)
2017-11-04 12:59:19 -07:00
e32885c9cd addons: Update prometheus from v1.8.0 to v1.8.2
* https://github.com/prometheus/prometheus/releases/tag/v1.8.2
2017-11-04 11:00:39 -07:00
fe8afdbee9 Update Typhoon logo and favicon 2017-11-04 01:20:17 -07:00
878f5a3647 Bump bootkube and terraform-render-bootkube to v0.8.1
* Use the v0.8.1 tagged terraform-render-bootkube module
* Use the v0.8.1 quay.io/coreos/bootkube image to bootstrap
2017-10-28 12:50:37 -07:00
34ec7e9862 Relax pessimistic constraints on 1.0+ providers
* Constrains ~> 1.0 means users can use 1.0.1, 1.1, but not 2.0
* https://www.terraform.io/docs/configuration/terraform.html
2017-10-25 23:27:28 -07:00
f6c6e85f84 Require minimum Terraform and plugin versions
* Bump minimum Terraform version to v0.10.4
* Allow minor version updates for 1.0+ plugins
* Fix versions for plugins which are pre-1.0
2017-10-25 23:00:31 -07:00
8582e19077 Expand Nginx Ingress liveness and readiness probes
* Remove dnsPolicy: ClusterFirst
* https://github.com/kubernetes/ingress-nginx/pull/1584
2017-10-25 22:29:20 -07:00
3727c40c6c Update Nginx Ingress defaultbackend from 1.0 to 1.4
* https://github.com/kubernetes/ingress-nginx/pull/1568
2017-10-25 22:16:23 -07:00
b608f9c615 addons: Use service endpoints to scrape node-exporter 2017-10-24 22:59:00 -07:00
ec1dbb853c addons: Include kube-state-metrics exporter manifests 2017-10-24 22:59:00 -07:00
d046d45769 addons: Include Prometheus and node-exporter manifests 2017-10-24 22:58:59 -07:00
a73f57fe4e Update CLUO from v0.4.0 to v0.4.1 2017-10-24 22:14:03 -07:00
60bc8957c9 Update Kubernetes from v1.8.1 to v1.8.2
* Kubernetes v1.8.2 fixes a memory leak in the v1.8.1 apiserver
* Switch to using the `gcr.io/google_containers/hyperkube` for the
on-host kubelet and shutdown drains
* Update terraform-render-bootkube manifests generation
  * Update flannel from v0.8.0 to v0.9.0
  * Add `hairpinMode` to flannel CNI config
  * Add `--no-negcache` to kube-dns dnsmasq
2017-10-24 21:44:26 -07:00
8b78c65483 Update Google Cloud Kubernetes from v1.7.7 to v1.8.1 2017-10-20 16:09:11 -07:00
f86c00288f Add missing update-agent RBAC role to get pods
* Drain now gets pods, deletes pods, and waits for deletion
2017-10-20 01:21:46 -07:00
a57b3cf973 Update CLUO addon to v0.4.0 and RBAC ClusterRole 2017-10-20 00:40:17 -07:00
10c5487ad7 Add docs corrections for versions and log output 2017-10-20 00:39:17 -07:00
e4c479554c Update AWS, DO, BM Kubernetes from v1.7.7 to v1.8.1
* Update from bootkube v0.7.0 to v0.8.0
* Leave Google Cloud update to a followup commit
2017-10-19 21:10:04 -07:00
be113e77b4 Fix links and add Calico BGP peering notes 2017-10-17 19:10:18 -07:00
911c53e4ae Add Ubiquity EdgeRouter documentation 2017-10-17 18:51:40 -07:00
bfa8dfc75d Conditionally set networkd content on bare-metal
* Without this change, if a cluster doesn't set the controller
or worker networkd lists, an err "element() may not be used
with an empty list" occurs.
* controller_networkds and worker_networks are intended to be
optional and temporary, not required at all
2017-10-17 18:47:12 -07:00
43dc44623f Fix the terraform fmt of configs 2017-10-16 01:32:25 -07:00
734bc1d32a Add performance benchmark for flannel with bonded NICs 2017-10-16 01:12:13 -07:00
41e632280f Remove unused storage section ala PXE-only Matchbox templating 2017-10-16 00:42:20 -07:00
fc22f04dd6 Add temporary variables for multi-nic testing
* Accept ordered lists of controller and worker networkd configs
* Do not rely on these variables. They will be replaced with a
cleaner mechanism at a future date
2017-10-16 00:39:58 -07:00
377e14c80b Fix ingress addon docs recursive apply command 2017-10-16 00:29:04 -07:00
9ec8ec4afc Secure copy etcd TLS credentials to controllers only
* Controllers receive etcd TLS credentials
* Controllers and workers receive a kubeconfig
2017-10-14 20:48:02 -07:00
5c1ed37ff5 Add SSH key to user "debug" during disk-install phase
* Avoid adding SSH authorized key for user "core" during the disk
install, so that terraform apply cannot SSH until post-install
2017-10-14 20:37:42 -07:00
e765fb310d Allow setting custom PXE boot kernel_args on bare-metal 2017-10-14 19:39:10 -07:00
7b5ffd0085 Add Container Linux reboot-coordinator RBAC
* Add a reboot-coordinator namespace for CLUO components
* Define an RBAC ClusterRole for update-operator and update-agent
* Replace the older-style where CLUO ran in kube-system, with
admin privilege
2017-10-14 19:35:06 -07:00
123439c2a4 Remove or compress docs image assets 2017-10-14 19:12:22 -07:00
11453bac91 Update heapster addon from v1.4.0 to v1.4.3
* Use normal name and phase labels
2017-10-14 19:07:37 -07:00
dd0c61d1d9 Update Nginx Ingress controller addon to 0.9.0-beta.15 2017-10-14 18:30:58 -07:00
5c87529011 Demote Google Cloud from stable to beta
* See #34 postmortem and action items for context on
when stable status will be restored
2017-10-11 19:32:04 -07:00
a97df839ea google-cloud: Set disk.device_name to match API default
* Terraform provider "google" plugin releases leave the disk
device_name as "" by default. Recently the API has started to
set a default name "persistent-disk-0". Plan and apply show
all instance groups need to be recreated to "fix" the name
* Impact: Controller and worker instance groups are deleted
and recreated, deleting data on controllers and bringing
down clusters
* Fix: Explicitly set the disk_name to persistent-disk-0 so
that terraform finds no diff needs to be applied.
* https://github.com/poseidon/typhoon/issues/34
* https://github.com/terraform-providers/terraform-provider-google/issues/574
2017-10-11 18:04:39 -07:00
a5290dac32 Update docs to show Digital Ocean with on-host etcd 2017-10-09 23:47:32 -07:00
308c7dfb6e digital-ocean: Run etcd cluster on-host, across controllers
* Run etcd peers with TLS across controller nodes
* Deprecate self-hosted-etcd on the Digital Ocean platform
* Distribute etcd TLS certificates as part of initial provisioning
* Check the status of etcd by running `systemctl status etcd-member`
2017-10-09 22:43:23 -07:00
da63c89d71 Remove mention of ct plugin in bare-metal docs 2017-10-08 23:37:41 -07:00
62d7ccfff3 Add docs on provision time and network performance 2017-10-04 00:05:43 -07:00
1bc25c1036 Update Kubernetes from v1.7.5 to v1.7.7
* Update from bootkube v0.6.2 to v0.7.0
* Use renamed terraform-render-bootkube. Renamed from
bootkube-terraform to meet Terraform Module requirements
2017-10-03 21:03:15 -07:00
2d5a4ae1ef Update kube-dns image to address dnsmasq vulnerability
* https://security.googleblog.com/2017/10/behind-masq-yet-more-dns-and-dhcp.html
2017-10-02 10:27:10 -07:00
1ab27ae1f1 Fix status of the google-cloud module to production 2017-10-01 21:41:08 -07:00
def84aa5a0 docs: Add details about security features 2017-10-01 21:38:52 -07:00
dd883988bd Update from Calico v2.5.1 to v2.6.1
* Network policy improvements
* Update cni sidecar image from v1.10.0 to v1.11.0
* Lower log level in Calico CNI config from debug to info
2017-09-30 16:16:40 -07:00
e0d8917573 Add LICENSE to top-level of each module 2017-09-28 20:41:19 -07:00
f7f983c7da docs: Add docs and addons for Nginx AWS Ingress 2017-09-28 01:09:31 -07:00
b20233e05d aws: Add Ingress ELB DNS name output as ingress_dns_name
* Expose the Ingress ELB DNS name so application DNS records can
be defined in Terraform to resolve to the Ingress ELB
2017-09-28 00:46:17 -07:00
77e387cf83 Add top-level README.md with module overview 2017-09-27 22:09:52 -07:00
795428329a google-cloud: Move controller and worker submodules under kubernetes 2017-09-27 20:50:32 -07:00
f7dd959e9c bare-metal: Stop including etcd-network-checkpointer 2017-09-27 18:25:20 -07:00
b62a6def23 Merge pull request #26 from poseidon/fix-nfs-issue
Add Wants=rpc-statd.service to Kubelet
2017-09-24 20:18:22 -07:00
1b5caef4c1 Add Wants=rpc-statd.service to Kubelet
* Mounting NFS exports as volumes from some NFS servers fails because
the kubelet isn't starting rpc-statd as expected. Describing pods
that are stuck creating shows rpc.statd is required for remote locking
* Starting rpc-statd.service resolves the issue and all NFS mounts
seem to be working.
* Recommended approach https://github.com/coreos/bugs/issues/2074
2017-09-24 18:23:55 -07:00
767efabeb2 Merge pull request #23 from poseidon/drop-bm-self-etcd
bare-metal: Remove support for experimental_self_hosted_etcd
2017-09-23 16:55:25 -07:00
68726a2773 bare-metal: Remove support for experimental_self_hosted_etcd
* Transition from discouraging self-hosted etcd for bare-metal,
to removing it as an option
* See #13 and FAQ for self-hosted etcd discussion
2017-09-23 16:49:15 -07:00
4ea85b1ac8 Merge pull request #25 from poseidon/fix-bm-bootkube
bare-metal: Update to using Kubernetes v1.7.5 assets
2017-09-23 16:31:01 -07:00
74d8b9dabe *: Update bootkube-terraform sha hash to corresponding named tag
* bootkube-terraform v0.6.2 dbfb11c6eafa08f839eac2834ca1aca35dafe965
2017-09-23 14:10:42 -07:00
777c860b1c bare-metal: Update to using Kubernetes v1.7.5 control plane manifests
* bootkube-terraform module wasn't bumped for bare-metal
2017-09-23 14:04:18 -07:00
b033a94efc Merge pull request #24 from poseidon/improve-docs
README: Add IRC link, CHANGES.md, and minor fixes
2017-09-23 14:02:25 -07:00
235c8a5222 README: Add IRC link, CHANGES.md, and minor fixes 2017-09-23 13:55:44 -07:00
69cabd9486 Merge pull request #22 from poseidon/better-templating
bare-metal: Ues Terraform templating for Container Linux configs
2017-09-23 12:55:55 -07:00
bca96bb124 bare-metal: Ues Terraform templating for Container Linux configs
* Template bare-metal Container Linux configs with Terraform's
(limited) template_file module. This allows rendering problems
to be identified during `terraform plan` and is favored over
using the Matchbox templating feature when the configs are
served to PXE booting nodes.
* Writes a Matchbox profile for each machine, which will be served
as-is. The effect is the same, each node gets provisioned with its
own Container Linux config.
2017-09-23 11:49:12 -07:00
cd368c123f docs: Add missing Terraform plugin section for bare-metal 2017-09-18 22:36:01 -07:00
7c733bd314 Add Nginx Ingress controller addons and docs 2017-09-18 01:48:21 -07:00
229a4c5293 Merge pull request #18 from poseidon/add-aws
Add AWS module and docs
2017-09-17 23:50:49 -07:00
47387d552a docs: Add tutorial for AWS usage 2017-09-17 23:41:43 -07:00
7c046b6206 *: Fix Terraform fmt and comments 2017-09-17 21:43:00 -07:00
d8e4ac172a Add dghubble/pegasus AWS Kubernetes Terraform module 2017-09-17 21:40:33 -07:00
663f37ed6d google-cloud: Remove unused service accounts 2017-09-14 15:47:44 -07:00
fb5f63c8be google-cloud: Update kubelet.service unit to match upstream
* Mount host /opt/cni/bin in Kubelet to use host's CNI plugins
* Switch /var/run/kubelet-pod.uuid to /var/cache/kubelet-pod.uuid
to persist between reboots and cleanup old Kubelet pods
* Organize Kubelet flags in alphabetical order
2017-09-14 15:47:44 -07:00
0d6410505d bare-metal: Update kubelet.service unit to match upstream
* Mount host /opt/cni/bin in Kubelet to use host's CNI plugins
* Switch /var/run/kubelet-pod.uuid to /var/cache/kubelet-pod.uuid
to persist between reboots and cleanup old Kubelet pods
* Organize Kubelet flags in alphabetical order
2017-09-14 11:44:02 -07:00
2a2ed372c8 digital-ocean: Update kubelet.service unit to match upstream
* Mount host /opt/cni/bin in Kubelet to use host's CNI plugins
* Switch /var/run/kubelet-pod.uuid to /var/cache/kubelet-pod.uuid
to persist between reboots and cleanup old Kubelet pods
* Organize Kubelet flags in alphabetical order
2017-09-13 20:49:23 -07:00
2ff6d602d8 digital-ocean: Distribute kubeconfig via Terraform null_resource
* Keep kubeconfig out of DigitalOcean metadata user-data
2017-09-13 20:19:52 -07:00
64e8d207b1 Change bare-metal and GCE networking default to calico
* Switch networking default from flannel to calico
2017-09-12 09:16:58 -07:00
a441f5c6e0 Update Kubernetes from v1.7.3 to v1.7.5 2017-09-08 13:56:20 -07:00
00b61a26c0 docs: Add docs on Calico networking support
* Digital Ocean firewalls don't yet support the required
IP tunneling protocol so Calico cannot be used without
disabling firewalls right now.
2017-09-05 19:01:32 -07:00
1efe39d6bc Allow MTU for bare-metal Calico to be customized
* Calico on bare-metal defaults to IP-in-IP encapsulation and MTU 1480
2017-09-05 19:01:18 -07:00
ec46bc13ae Add support for Calico networking on GCE
* Calico on GCE with IP-in-IP encapsulation and MTU 1440
* Calico on DO with IP-in-IP encapsulation and MTU 1440
* Digital Ocean firewalls don't support IPIP protocol yet
2017-09-05 18:22:14 -07:00
d48f88cfd6 Fix typo in the issue template 2017-09-04 20:56:01 -07:00
6ef326a872 bare-metal: Add support for Calico networking
* Add variable networking with "flannel" or "calico"
2017-09-01 17:52:22 -07:00
64435adbc3 Merge pull request #7 from ericchiang/fix-link
README.md: fix addons link
2017-08-29 10:51:35 -07:00
140e869278 README.md: fix addons link 2017-08-29 10:49:01 -07:00
082dedbdbd docs: Fix broken addons overview.md link 2017-08-27 21:11:24 -07:00
a2609c14c0 addons: Disable Google Analytics in CLUO 2017-08-27 21:06:49 -07:00
564c0160bf Add heapster, dashboard, and CLUO addons 2017-08-27 17:20:29 -07:00
5b2275872c Update README to match docs index page 2017-08-27 16:09:23 -07:00
2faacc6a50 Add concepts, tutorials, and faq docs
* Add bare-metal tutorial
* Add DigitalOcean tutorial
* Add Google Cloud tutorial
2017-08-27 15:21:57 -07:00
056bd8a059 google-cloud: Remove deprecated automatic_restart field
* In terraform-provider-google v0.1.3, it is no longer neccessary
to supply a (duplicated) value for the instance_template field
automatic_restart
* Previously this field was set to match the scheduling
automatic_restart since the field defaulted to true and would
cause plan to always show changes were needed
2017-08-25 00:14:02 -07:00
6a574d4a01 Organize README to work with published docs 2017-08-23 00:53:21 -07:00
b29a6cd1cd digital-ocean: Fix the digital-ocean default variables.tf
* Set the controller_type default to 2gb, the minimum that will
work
2017-08-23 00:53:03 -07:00
a97bbf7128 digital-ocean: Switch droplet tag string to tag reference
* Without a reference a Digital Ocean tag object, terraform may
try to create a firewall rule before a tag actually exists. By
referencing the actual tag objects, the dependency order is
implied
2017-08-16 20:13:18 -07:00
dc3ff174ea Update Kubernetes from v1.7.1 to v1.7.3 2017-08-16 20:12:59 -07:00
fc018ffa28 Rename project and organization 2017-08-14 19:24:04 -07:00
bac968d3eb Simplify google-cloud cluster variables
* Remove k8s_domain_name input variable, the controller DNS
record will be "${var.cluster_name}.${dns_zone}"
* Rename dns_base_zone to dns_zone
* Rename dns_base_zone_name to dns_zone_name
2017-08-13 13:06:12 -07:00
40bd338eab Add Github issue and pull request templates 2017-08-13 12:30:30 -07:00
e5975cf9c7 Add CONTRIBUTING.md and DCO agreement 2017-08-13 12:27:17 -07:00
e19517d3df Fix the terraform fmt of configs 2017-08-12 18:26:05 -07:00
f04411377f digital-ocean: Add cluster firewall rules
* Requires Terraform v0.10.0+
2017-08-12 18:22:18 -07:00
238 changed files with 21068 additions and 1137 deletions

33
.github/ISSUE_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,33 @@
<!-- Fill in either the 'Bug' or 'Feature Request' section -->
## Bug
### Environment
* Platform: aws, bare-metal, google-cloud, digital-ocean
* OS: container-linux, fedora-atomic
* Terraform: `terraform version`
* Plugins: Provider plugin versions
* Ref: Git SHA (if applicable)
### Problem
Describe the problem.
### Desired Behavior
Describe the goal.
### Steps to Reproduce
Provide clear steps to reproduce the issue unless already covered.
## Feature Request
### Feature
Describe the feature and what problem it solves.
### Tradeoffs
What are the pros and cons of this feature? How will it be exercised and maintained?

10
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,10 @@
High level description of the change.
* Specific change
* Specific change
## Testing
Describe your work to validate the change works.
rel: issue number (if applicable)

390
CHANGES.md Normal file
View File

@ -0,0 +1,390 @@
# Typhoon
Notable changes between versions.
## Latest
* [Introduce](https://typhoon.psdn.io/announce/#april-26-2018) Typhoon for Fedora Atomic ([#199](https://github.com/poseidon/typhoon/pull/199))
* Kubernetes [v1.10.2](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#v1102)
* Update Calico from v3.0.4 to v3.1.1 ([#197](https://github.com/poseidon/typhoon/pull/197))
* https://www.projectcalico.org/announcing-calico-v3-1/
* https://github.com/projectcalico/calico/releases/tag/v3.1.0
* Update etcd from v3.3.3 to v3.3.4
* Update kube-dns from v1.14.9 to v1.14.10
#### Google Cloud
* Add support for multi-controller clusters (i.e. multi-master) ([#54](https://github.com/poseidon/typhoon/issues/54), [#190](https://github.com/poseidon/typhoon/pull/190))
* Switch from Google Cloud network load balancer to a TCP proxy load balancer. Avoid a [bug](https://issuetracker.google.com/issues/67366622) in Google network load balancers that limited clusters to only bootstrapping one controller node.
* Add TCP health check for apiserver pods on controllers. Replace kubelet check approximation.
#### Addons
* Update nginx-ingress from 0.12.0 to 0.14.0
* Update kube-state-metrics from v1.3.0 to v1.3.1
## v1.10.1
* Kubernetes [v1.10.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#v1101)
* Enable etcd v3.3 metrics endpoint ([#175](https://github.com/poseidon/typhoon/pull/175))
* Use `k8s.gcr.io` instead of `gcr.io/google_containers` ([#180](https://github.com/poseidon/typhoon/pull/180))
* Kubernetes [recommends](https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ) using the alias to pull from the nearest regional mirror and to abstract the backing container registry
* Update etcd from v3.3.2 to v3.3.3
* Update kube-dns from v1.14.8 to v1.14.9
* Use kubernetes-incubator/bootkube v0.12.0
#### Bare-Metal
* Fix need for multiple `terraform apply` runs to create a cluster with Terraform v0.11.4 ([#181](https://github.com/poseidon/typhoon/pull/181))
* To SSH during a disk install for debugging, SSH as user "core" with port 2222
* Remove the old trick of using a user "debug" during disk install
#### Google Cloud
* Refactor out the `controller` internal module
#### Addons
* Add Prometheus discovery for etcd peers on controller nodes ([#175](https://github.com/poseidon/typhoon/pull/175))
* Scrape etcd v3.3 `--listen-metrics-urls` for metrics
* Enable etcd alerts and populate the etcd Grafana dashboard
* Update kube-state-metrics from v1.2.0 to v1.3.0
## v1.10.0
* Kubernetes [v1.10.0](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#v1100)
* Remove unused, unmaintained `pxe-worker` internal module
#### AWS
* Add `disk_type` optional variable for setting the EBS volume type ([#176](https://github.com/poseidon/typhoon/pull/176))
* Change default type from `standard` to `gp2`. Prometheus etcd alerts are tuned for fast disks.
#### Digital Ocean
* Ensure etcd secrets are only distributed to controller hosts, not workers.
* Remove `networking` optional variable. Only flannel works on Digital Ocean.
#### Google Cloud
* Add `disk_size` optional variable for setting instance disk size in GB
* Add `controller_type` optional variable for setting machine type for controllers
* Add `worker_type` optional variable for setting machine type for workers
* Remove `machine_type` optional variable. Use `controller_type` and `worker_type`.
#### Addons
* Update Grafana from v4.6.3 to v5.0.4 ([#153](https://github.com/poseidon/typhoon/pull/153), [#174](https://github.com/poseidon/typhoon/pull/174))
* Restrict dashboard organization role to Viewer
## v1.9.6
* Kubernetes [v1.9.6](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v196)
* Update Calico from v3.0.3 to v3.0.4
#### Addons
* Update heapster from v1.5.1 to v1.5.2
## v1.9.5
* Kubernetes [v1.9.5](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v195)
* Fix `subPath` volume mounts regression ([kubernetes#61076](https://github.com/kubernetes/kubernetes/issues/61076))
* Introduce [Container Linux Config snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) on cloud platforms ([#145](https://github.com/poseidon/typhoon/pull/145))
* Validate and additively merge custom Container Linux Configs during `terraform plan`
* Define files, systemd units, dropins, networkd configs, mounts, users, and more
* Require updating `terraform-provider-ct` plugin from v0.2.0 to v0.2.1
* Add `node-role.kubernetes.io/controller="true"` node label to controllers ([#160](https://github.com/poseidon/typhoon/pull/160))
#### AWS
* [Require](https://typhoon.psdn.io/topics/maintenance/#terraform-provider-ct-v021) updating `terraform-provider-ct` plugin from v0.2.0 to [v0.2.1](https://github.com/coreos/terraform-provider-ct/releases/tag/v0.2.1) (action required!)
#### Digital Ocean
* [Require](https://typhoon.psdn.io/topics/maintenance/#terraform-provider-ct-v021) updating `terraform-provider-ct` plugin from v0.2.0 to [v0.2.1](https://github.com/coreos/terraform-provider-ct/releases/tag/v0.2.1) (action required!)
#### Google Cloud
* [Require](https://typhoon.psdn.io/topics/maintenance/#terraform-provider-ct-v021) updating `terraform-provider-ct` plugin from v0.2.0 to [v0.2.1](https://github.com/coreos/terraform-provider-ct/releases/tag/v0.2.1) (action required!)
* Relax `os_image` to optional. Default to "coreos-stable".
#### Addons
* Update nginx-ingress from 0.11.0 to 0.12.0
* Update Prometheus from 2.2.0 to 2.2.1
## v1.9.4
* Kubernetes [v1.9.4](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v194)
* Secret, configMap, downward API, and projected volumes now read-only (breaking, [kubernetes#58720](https://github.com/kubernetes/kubernetes/pull/58720))
* Regressed `subPath` volume mounts (regression, [kubernetes#61076](https://github.com/kubernetes/kubernetes/issues/61076))
* Mitigated `subPath` [CVE-2017-1002101](https://github.com/kubernetes/kubernetes/issues/60813)
* Introduce [worker pools](https://typhoon.psdn.io/advanced/worker-pools/) for AWS and Google Cloud for joining heterogeneous workers to existing clusters.
* Use new Network Load Balancers and cross zone load balancing on AWS
* Allow flexvolume plugins to be used on any Typhoon cluster (not just bare-metal)
* Upgrade etcd from v3.2.15 to v3.3.2
* Update Calico from v3.0.2 to v3.0.3
* Use kubernetes-incubator/bootkube v0.11.0
* [Recommend](https://typhoon.psdn.io/topics/maintenance/#terraform-provider-ct-v021) updating `terraform-provider-ct` plugin from v0.2.0 to [v0.2.1](https://github.com/coreos/terraform-provider-ct/releases/tag/v0.2.1) (action recommended)
#### AWS
* Promote AWS platform to stable
* Allow groups of workers to be defined and joined to a cluster (i.e. worker pools) ([#150](https://github.com/poseidon/typhoon/pull/150))
* Replace the apiserver elastic load balancer with a network load balancer ([#136](https://github.com/poseidon/typhoon/pull/136))
* Replace the Ingress elastic load balancer with a network load balancer ([#141](https://github.com/poseidon/typhoon/pull/141))
* AWS [NLBs](https://aws.amazon.com/blogs/aws/new-network-load-balancer-effortless-scaling-to-millions-of-requests-per-second/) can handle millions of RPS with high throughput and low latency.
* Require `terraform-provider-aws` 1.7.0 or higher
* Enable NLB [cross-zone](https://aws.amazon.com/about-aws/whats-new/2018/02/network-load-balancer-now-supports-cross-zone-load-balancing/) load balancing ([#159](https://github.com/poseidon/typhoon/pull/159))
* Requests are automatically evenly distributed to targets regardless of AZ
* Require `terraform-provider-aws` 1.11.0 or higher
* Add kubelet `--volume-plugin-dir` flag to allow flexvolume plugins ([#142](https://github.com/poseidon/typhoon/pull/142))
* Fix controller and worker launch configs to ignore AMI changes ([#126](https://github.com/poseidon/typhoon/pull/126), [#158](https://github.com/poseidon/typhoon/pull/158))
#### Digital Ocean
* Add kubelet `--volume-plugin-dir` flag to allow flexvolume plugins ([#142](https://github.com/poseidon/typhoon/pull/142))
* Fix to pass `ssh_fingerprints` as a list to droplets ([#143](https://github.com/poseidon/typhoon/pull/143))
#### Google Cloud
* Allow groups of workers to be defined and joined to a cluster (i.e. worker pools) ([#148](https://github.com/poseidon/typhoon/pull/148))
* Add kubelet `--volume-plugin-dir` flag to allow flexvolume plugins ([#142](https://github.com/poseidon/typhoon/pull/142))
* Add `kubeconfig` variable to `controllers` and `workers` submodules ([#147](https://github.com/poseidon/typhoon/pull/147))
* Remove `kubeconfig_*` variables from `controllers` and `workers` submodules ([#147](https://github.com/poseidon/typhoon/pull/147))
* Allow initial experimentation with accelerators (i.e. GPUs) on workers ([#161](https://github.com/poseidon/typhoon/pull/161)) (unofficial)
* Require `terraform-provider-google` v1.6.0
#### Addons
* Update Prometheus from 2.1.0 to 2.2.0 ([#153](https://github.com/poseidon/typhoon/pull/153))
* Scrape Prometheus itself to enable alerts about Prometheus itself
* Adjust KubeletDown rule to fire when 10% of kubelets are down
* Update heapster from v1.5.0 to v1.5.1 ([#131](https://github.com/poseidon/typhoon/pull/131))
* Use separate service account
* Update nginx-ingress from 0.10.2 to 0.11.0
## v1.9.3
* Kubernetes [v1.9.3](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v193)
* Network improvements and fixes ([#104](https://github.com/poseidon/typhoon/pull/104))
* Switch from Calico v2.6.6 to v3.0.2
* Add Calico GlobalNetworkSet CRD
* Update flannel from v0.9.0 to v0.10.0
* Use separate service account for flannel
* Update etcd from v3.2.14 to v3.2.15
#### Digital Ocean
* Use new Droplet [types](https://developers.digitalocean.com/documentation/changelog/api-v2/new-size-slugs-for-droplet-plan-changes/) which offer more CPU/memory, at lower cost. ([#105](https://github.com/poseidon/typhoon/pull/105))
* A small Digital Ocean cluster costs less than $25 a month!
#### Addons
* Update Prometheus from v2.0.0 to v2.1.0 ([#113](https://github.com/poseidon/typhoon/pull/113))
* Improve alerting rules
* Relabel discovered kubelet, endpoint, service, and apiserver scrapes
* Use separate service accounts
* Update node-exporter and kube-state-metrics
* Include Grafana dashboards for Kubernetes admins ([#113](https://github.com/poseidon/typhoon/pull/113))
* Add grafana-watcher to load bundled upstream dashboards
* Update nginx-ingress from 0.9.0 to 0.10.2
* Update CLUO from v0.5.0 to v0.6.0
* Switch manifests to use `apps/v1` Deployments and Daemonsets ([#120](https://github.com/poseidon/typhoon/pull/120))
* Remove Kubernetes Dashboard manifests ([#121](https://github.com/poseidon/typhoon/pull/121))
## v1.9.2
* Kubernetes [v1.9.2](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v192)
* Add Terraform v0.11.x support
* Add explicit "providers" section to modules for Terraform v0.11.x
* Retain support for Terraform v0.10.4+
* Add [migration guide](https://typhoon.psdn.io/topics/maintenance/#terraform-v011x) from Terraform v0.10.x to v0.11.x (**action required!**)
* Update etcd from 3.2.13 to 3.2.14
* Update calico from 2.6.5 to 2.6.6
* Update kube-dns from v1.14.7 to v1.14.8
* Use separate service account for kube-dns
* Use kubernetes-incubator/bootkube v0.10.0
#### Bare-Metal
* Use per-node Container Linux install profiles ([#97](https://github.com/poseidon/typhoon/pull/97))
* Allow Container Linux channel/version to be chosen per-cluster
* Fix issue where cluster deletion could require `terraform apply` multiple times
#### Digital Ocean
* Relax `digitalocean` provider version constraint
* Fix bug with `terraform plan` always showing a firewall diff to be applied ([#3](https://github.com/poseidon/typhoon/issues/3))
#### Addons
* Update CLUO to v0.5.0 to fix compatibility with Kubernetes 1.9 (**important**)
* Earlier versions can't roll out Container Linux updates on Kubernetes 1.9 nodes ([cluo#163](https://github.com/coreos/container-linux-update-operator/issues/163))
* Update kube-state-metrics from v1.1.0 to v1.2.0
* Fix RBAC cluster role for kube-state-metrics
## v1.9.1
* Kubernetes [v1.9.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md#v191)
* Update kube-dns from 1.14.5 to v1.14.7
* Update etcd from 3.2.0 to 3.2.13
* Update Calico from v2.6.4 to v2.6.5
* Enable portmap to fix hostPort with Calico
* Use separate service account for controller-manager
## v1.8.6
* Kubernetes [v1.8.6](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#v186)
* Update Calico from v2.6.3 to v2.6.4
## v1.8.5
* Kubernetes [v1.8.5](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#v185)
* Recommend Container Linux [images](https://coreos.com/releases/) with Docker 17.09
* Container Linux stable, beta, and alpha now provide Docker 17.09 (instead
of 1.12)
* Older clusters (with CLUO addon) auto-update Container Linux version to begin using Docker 17.09
* Fix race where `etcd-member.service` could fail to resolve peers ([#69](https://github.com/poseidon/typhoon/pull/69))
* Add optional `cluster_domain_suffix` variable (#74)
* Use kubernetes-incubator/bootkube v0.9.1
#### Bare-Metal
* Add kubelet `--volume-plugin-dir` flag to allow flexvolume providers ([#61](https://github.com/poseidon/typhoon/pull/61))
#### Addons
* Discourage deploying the Kubernetes Dashboard (security)
## v1.8.4
* Kubernetes v1.8.4
* Calico related bug fixes
* Update Calico from v2.6.1 to v2.6.3
* Update flannel from v0.9.0 to v0.9.1
* Service accounts for kube-proxy and pod-checkpointer
* Use kubernetes-incubator/bootkube v0.9.0
## v1.8.3
* Kubernetes v1.8.3
* Run etcd on-host, across controllers
* Promote AWS platform to beta
* Use kubernetes-incubator/bootkube v0.8.2
#### Google Cloud
* Add required variable `region` (e.g. "us-central1")
* Reduce time to bootstrap a cluster
* Change etcd to run on-host, across controllers (etcd-member.service)
* Change controller instances to automatically span zones in the region
* Change worker managed instance group to automatically span zones in the region
* Improve internal firewall rules and use tag-based firewall policies
* Remove support for self-hosted etcd
* Remove the `zone` required variable
* Remove the `controller_preemptible` optional variable
#### AWS
* Promote AWS platform to beta
* Reduce time to bootstrap a cluster
* Change etcd to run on-host, across controllers (etcd-member.service)
* Fix firewall rules for multi-controller kubelet scraping and node-exporter
* Remove support for self-hosted etcd
#### Addons
* Add Prometheus 2.0 addon with alerting rules
* Add Grafana dashboard for observing metrics
## v1.8.2
* Kubernetes v1.8.2
* Fixes a memory leak in the v1.8.1 apiserver ([kubernetes#53485](https://github.com/kubernetes/kubernetes/issues/53485))
* Switch to using the `gcr.io/google_containers/hyperkube`
* Update flannel from v0.8.0 to v0.9.0
* Add `hairpinMode` to flannel CNI config
* Add `--no-negcache` to kube-dns dnsmasq
* Use kubernetes-incubator/bootkube v0.8.1
## v1.8.1
* Kubernetes v1.8.1
* Use kubernetes-incubator/bootkube v0.8.0
#### Digital Ocean
* Run etcd cluster across controller nodes (etcd-member.service)
* Remove support for self-hosted etcd
* Reduce time to bootstrap a cluster
## v1.7.7
* Kubernetes v1.7.7
* Use kubernetes-incubator/bootkube v0.7.0
* Update kube-dns to 1.14.5 to fix dnsmasq [vulnerability](https://security.googleblog.com/2017/10/behind-masq-yet-more-dns-and-dhcp.html)
* Calico v2.6.1
* flannel-cni v0.3.0
* Update flannel CNI config to fix hostPort
## v1.7.5
* Kubernetes v1.7.5
* Use kubernetes-incubator/bootkube v0.6.2
* Add AWS Terraform module (alpha)
* Add support for Calico networking (bare-metal, Google Cloud, AWS)
* Change networking default from "flannel" to "calico"
#### AWS
* Add `network_mtu` to allow CNI interface MTU customization
#### Bare-Metal
* Add `network_mtu` to allow CNI interface MTU customization
* Remove support for `experimental_self_hosted_etcd`
## v1.7.3
* Kubernetes v1.7.3
* Use kubernetes-incubator/bootkube v0.6.1
#### Digital Ocean
* Add cloud firewall rules (requires Terraform v0.10)
* Change nodes tags from strings to DO tags
## v1.7.1
* Kubernetes v1.7.1
* Use kubernetes-incubator/bootkube v0.6.0
* Add Bare-Metal Terraform module (stable)
* Add Digital Ocean Terraform module (beta)
#### Google Cloud
* Remove `k8s_domain_name` variable, `cluster_name` + `dns_zone` resolves to controllers
* Rename `dns_base_zone` to `dns_zone`
* Rename `dns_base_zone_name` to `dns_zone_name`
## v1.6.7
* Kubernetes v1.6.7
* Use kubernetes-incubator/bootkube v0.5.1
## v1.6.6
* Kubernetes v1.6.6
* Use kubernetes-incubator/bootkube v0.4.5
* Disable locksmithd on hosts, in favor of [CLUO](https://github.com/coreos/container-linux-update-operator).
## v1.6.4
* Kubernetes v1.6.4
* Add Google Cloud Terraform module (stable)
## Earlier
Earlier versions, back to v1.3.0, used different designs and mechanisms.

5
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,5 @@
# Contributing
## Developer Certificate of Origin
By contributing, you agree to the Linux Foundation's Developer Certificate of Origin ([DCO](DCO)). The DCO is a statement that you, the contributor, have the legal right to make your contribution and understand the contribution will be distributed as part of this project.

37
DCO Normal file
View File

@ -0,0 +1,37 @@
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

View File

@ -1,5 +1,6 @@
The MIT License (MIT)
Copyright (c) 2017 Typhoon Authors
Copyright (c) 2017 Dalton Hubble
Permission is hereby granted, free of charge, to any person obtaining a copy

151
README.md
View File

@ -1,55 +1,140 @@
# purenetes <img align="right" src="https://storage.googleapis.com/dghubble/spin.png">
# Typhoon [![IRC](https://img.shields.io/badge/freenode-%23typhoon-0099ef.svg)]() <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
* Minimal, stable base Kubernetes distribution
* Declarative infrastructure and configuration
* Practical for small labs to medium clusters
* 100% [free](https://www.debian.org/intro/free) components (both freedom and zero cost)
* Respect for privacy by requiring analytics be opt-in
* [Free](#social-contract) (freedom and cost) and privacy-respecting
* Practical for labs, datacenters, and clouds
## Status
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
Purenetes is [dghubble](https://twitter.com/dghubble)'s personal Kubernetes distribution. It powers his cloud and colocation clusters. While functional, it is not yet suited for the public.
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
## Features
* Kubernetes v1.7.1 with self-hosted control plane via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube)
* Secure etcd with generated TLS certs, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, generated admin kubeconfig
* Multi-master, workload isolation
* Ingress-ready (perhaps include by default)
* Works with your existing Terraform infrastructure and secret management
## Documentation
See [docs.purenetes.org](https://docs.purenetes.org)
* Kubernetes v1.10.2 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/) and [preemption](https://typhoon.psdn.io/google-cloud/#preemption) (varies by platform)
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Modules
Purenetes provides a Terraform Module for each supported operating system and platform.
Typhoon provides a Terraform Module for each supported operating system and platform.
| Platform | Operating System | Terraform Module |
|---------------|------------------|------------------|
| Bare-Metal | Container Linux | bare-metal/container-linux/kubernetes |
| Google Cloud | Container Linux | google-cloud/container-linux/kubernetes |
| Digital Ocean | Container Linux | digital-ocean/container-linux/kubernetes |
| Platform | Operating System | Terraform Module | Status |
|---------------|------------------|------------------|--------|
| AWS | Container Linux | [aws/container-linux/kubernetes](aws/container-linux/kubernetes) | stable |
| AWS | Fedora Atomic | [aws/fedora-atomic/kubernetes](aws/fedora-atomic/kubernetes) | alpha |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal/container-linux/kubernetes) | stable |
| Bare-Metal | Fedora Atomic | [bare-metal/fedora-atomic/kubernetes](bare-metal/fedora-atomic/kubernetes) | alpha |
| Digital Ocean | Container Linux | [digital-ocean/container-linux/kubernetes](digital-ocean/container-linux/kubernetes) | beta |
| Digital Ocean | Fedora Atomic | [digital-ocean/fedora-atomic/kubernetes](digital-ocean/fedora-atomic/kubernetes) | alpha |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud/container-linux/kubernetes) | beta |
| Google Cloud | Fedora Atomic | [google-cloud/fedora-atomic/kubernetes](google-cloud/fedora-atomic/kubernetes) | very alpha |
## Customization
## Documentation
To customize clusters in ways that aren't supported by input variables, fork the repo and make changes to the Terraform module. Stay tuned for improvements to this strategy since its beneficial to stay close to this upstream.
* [Docs](https://typhoon.psdn.io)
* Architecture [concepts](https://typhoon.psdn.io/architecture/concepts/) and [operating systems](https://typhoon.psdn.io/architecture/operating-systems/)
* Tutorials for [AWS](https://typhoon.psdn.io/cl/aws/), [Bare-Metal](https://typhoon.psdn.io/cl/bare-metal/), [Digital Ocean](https://typhoon.psdn.io/cl/digital-ocean/), and [Google-Cloud](https://typhoon.psdn.io/cl/google-cloud/)
To customize lower-level Kubernetes control plane bootstrapping, see the [purenetes/bootkube-terraform](https://github.com/purenetes/bootkube-terraform) Terraform module.
## Usage
## Contributing
Define a Kubernetes cluster by using the Terraform module for your chosen platform and operating system. Here's a minimal example:
Currently, `purenetes` is the author's personal distribution of Kubernetes. It is focused on addressing the author's cluster needs and is not yet accepting sizable contributions. As the project matures, this contributing policy will be changed to reflect those of a community project.
```tf
module "google-cloud-yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes?ref=v1.10.2"
providers = {
google = "google.default"
local = "local.default"
null = "null.default"
template = "template.default"
tls = "tls.default"
}
## Social Contract
# Google Cloud
cluster_name = "yavin"
region = "us-central1"
dns_zone = "example.com"
dns_zone_name = "example-zone"
*A formal social contract is being drafted, inspired by the Debian [Social Contract](https://www.debian.org/social_contract).*
# configuration
ssh_authorized_key = "ssh-rsa AAAAB3Nz..."
asset_dir = "/home/user/.secrets/clusters/yavin"
# optional
worker_count = 2
}
```
For now, know that `purenetes` is not a product, trial, or free-tier. It is not run by a company, it does not offer support or services, and it does not accept or make any money. It is not associated with any operating system or cloud platform vendors.
Fetch modules, plan the changes to be made, and apply the changes.
Disclosure: The author works for CoreOS, but that work is kept as separate as possible. Support for Fedora is planned to ensure no one distro is favored and because the author wants it.
```sh
$ terraform init
$ terraform get --update
$ terraform plan
Plan: 37 to add, 0 to change, 0 to destroy.
$ terraform apply
Apply complete! Resources: 37 added, 0 changed, 0 destroyed.
```
In 4-8 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a `yavin.example.com` DNS record to resolve to a network load balancer across controller nodes.
```sh
$ export KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-0.c.example-com.internal Ready 6m v1.10.2
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.10.2
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.10.2
```
List the pods.
```
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-1cs8z 2/2 Running 0 6m
kube-system calico-node-d1l5b 2/2 Running 0 6m
kube-system calico-node-sp9ps 2/2 Running 0 6m
kube-system kube-apiserver-zppls 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-gh9kt 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-h90v8 1/1 Running 1 6m
kube-system kube-dns-1187388186-zj5dl 3/3 Running 0 6m
kube-system kube-proxy-117v6 1/1 Running 0 6m
kube-system kube-proxy-9886n 1/1 Running 0 6m
kube-system kube-proxy-njn47 1/1 Running 0 6m
kube-system kube-scheduler-3895335239-5x87r 1/1 Running 0 6m
kube-system kube-scheduler-3895335239-bzrrt 1/1 Running 1 6m
kube-system pod-checkpointer-l6lrt 1/1 Running 0 6m
```
## Non-Goals
* In-place Kubernetes upgrades (instead, deploy blue/green clusters and failover)
Typhoon is strict about minimalism, maturity, and scope. These are not in scope:
* In-place Kubernetes Upgrades
* Adding every possible option
* Openstack or Mesos platforms
## Help
Ask questions on the IRC #typhoon channel on [freenode.net](http://freenode.net/).
## Motivation
Typhoon powers the author's cloud and colocation clusters. The project has evolved through operational experience and Kubernetes changes. Typhoon is shared under a free license to allow others to use the work freely and contribute to its upkeep.
Typhoon addresses real world needs, which you may share. It is honest about limitations or areas that aren't mature yet. It avoids buzzword bingo and hype. It does not aim to be the one-solution-fits-all distro. An ecosystem of Kubernetes distributions is healthy.
## Social Contract
Typhoon is not a product, trial, or free-tier. It is not run by a company, does not offer support or services, and does not accept or make any money. It is not associated with any operating system or platform vendor.
Typhoon clusters will contain only [free](https://www.debian.org/intro/free) components. Cluster components will not collect data on users without their permission.
## Donations
Typhoon does not accept money donations. Instead, we encourage you to donate to one of [these organizations](https://github.com/poseidon/typhoon/wiki/Donations) to show your appreciation.

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: reboot-coordinator

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: reboot-coordinator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: reboot-coordinator
subjects:
- kind: ServiceAccount
namespace: reboot-coordinator
name: default

View File

@ -0,0 +1,45 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: reboot-coordinator
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- get
- update
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- delete
- apiGroups:
- "extensions"
resources:
- daemonsets
verbs:
- get

View File

@ -0,0 +1,59 @@
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: container-linux-update-agent
namespace: reboot-coordinator
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
app: container-linux-update-agent
template:
metadata:
labels:
app: container-linux-update-agent
spec:
containers:
- name: update-agent
image: quay.io/coreos/container-linux-update-operator:v0.6.0
command:
- "/bin/update-agent"
volumeMounts:
- mountPath: /var/run/dbus
name: var-run-dbus
- mountPath: /etc/coreos
name: etc-coreos
- mountPath: /usr/share/coreos
name: usr-share-coreos
- mountPath: /etc/os-release
name: etc-os-release
env:
# read by update-agent as the node name to manage reboots for
- name: UPDATE_AGENT_NODE
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
volumes:
- name: var-run-dbus
hostPath:
path: /var/run/dbus
- name: etc-coreos
hostPath:
path: /etc/coreos
- name: usr-share-coreos
hostPath:
path: /usr/share/coreos
- name: etc-os-release
hostPath:
path: /etc/os-release

View File

@ -0,0 +1,29 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: container-linux-update-operator
namespace: reboot-coordinator
spec:
replicas: 1
selector:
matchLabels:
app: container-linux-update-operator
template:
metadata:
labels:
app: container-linux-update-operator
spec:
containers:
- name: update-operator
image: quay.io/coreos/container-linux-update-operator:v0.6.0
command:
- "/bin/update-operator"
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-providers
namespace: monitoring
data:
dashboard-providers.yaml: |+
apiVersion: 1
providers:
- name: 'default'
ordId: 1
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards

View File

@ -0,0 +1,7361 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
deployment-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 1,
"hideControls": false,
"links": [],
"rows": [
{
"collapse": false,
"editable": false,
"height": "200px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 8,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "cores",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m]))",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "CPU",
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 9,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "GB",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "80%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(container_memory_usage_bytes{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}) / 1024^3",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Memory",
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "Bps",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": false
},
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{namespace=\"$deployment_namespace\",pod_name=~\"$deployment_name.*\"}[3m]))",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Network",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "100px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": false
},
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"metric": "kube_deployment_spec_replicas",
"refId": "A",
"step": 600
}
],
"title": "Desired Replicas",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 6,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "min(kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Available Replicas",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 3,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_deployment_status_observed_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Observed Generation",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_deployment_metadata_generation{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Metadata Generation",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "350px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 1,
"isNew": true,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "max(kube_deployment_status_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "current replicas",
"refId": "A",
"step": 30
},
{
"expr": "min(kube_deployment_status_replicas_available{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "available",
"refId": "B",
"step": 30
},
{
"expr": "max(kube_deployment_status_replicas_unavailable{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "unavailable",
"refId": "C",
"step": 30
},
{
"expr": "min(kube_deployment_status_replicas_updated{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "updated",
"refId": "D",
"step": 30
},
{
"expr": "max(kube_deployment_spec_replicas{deployment=\"$deployment_name\",namespace=\"$deployment_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "desired",
"refId": "E",
"step": 30
}
],
"title": "Replicas",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "none",
"label": "",
"logBase": 1,
"show": true
},
{
"format": "short",
"label": "",
"logBase": 1,
"show": false
}
]
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": ".*",
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": "Namespace",
"multi": false,
"name": "deployment_namespace",
"options": [],
"query": "label_values(kube_deployment_metadata_generation, namespace)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": null,
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": "Deployment",
"multi": false,
"name": "deployment_name",
"options": [],
"query": "label_values(kube_deployment_metadata_generation{namespace=\"$deployment_namespace\"}, deployment)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "deployment",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Deployment",
"version": 1
}
etcd-dashboard.json: |+
{
"__inputs": [
{
"name": "prometheus",
"label": "prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "4.5.2"
},
{
"type": "panel",
"id": "graph",
"name": "Graph",
"version": ""
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "singlestat",
"name": "Singlestat",
"version": ""
}
],
"annotations": {
"list": []
},
"description": "etcd sample Grafana dashboard with Prometheus",
"editable": false,
"gnetId": null,
"graphTooltip": 0,
"hideControls": false,
"id": null,
"links": [],
"refresh": false,
"rows": [
{
"collapse": false,
"height": "250px",
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"error": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 28,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum(etcd_server_has_leader)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"metric": "etcd_server_has_leader",
"refId": "A",
"step": 20
}
],
"thresholds": "",
"title": "Up",
"type": "singlestat",
"valueFontSize": "200%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"id": 23,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 5,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(grpc_server_started_total{grpc_type=\"unary\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "RPC Rate",
"metric": "grpc_server_started_total",
"refId": "A",
"step": 4
},
{
"expr": "sum(rate(grpc_server_handled_total{grpc_type=\"unary\",grpc_code!=\"OK\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "RPC Failed Rate",
"metric": "grpc_server_handled_total",
"refId": "B",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "RPC Rate",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "ops",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"id": 41,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 4,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum(grpc_server_started_total{grpc_service=\"etcdserverpb.Watch\",grpc_type=\"bidi_stream\"}) - sum(grpc_server_handled_total{grpc_service=\"etcdserverpb.Watch\",grpc_type=\"bidi_stream\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Watch Streams",
"metric": "grpc_server_handled_total",
"refId": "A",
"step": 4
},
{
"expr": "sum(grpc_server_started_total{grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"}) - sum(grpc_server_handled_total{grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Lease Streams",
"metric": "grpc_server_handled_total",
"refId": "B",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Active Streams",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Row",
"titleSize": "h6"
},
{
"collapse": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"editable": false,
"error": false,
"fill": 0,
"grid": {},
"id": 1,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "etcd_debugging_mvcc_db_total_size_in_bytes",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{instance}} DB Size",
"metric": "",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "DB Size",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": false
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"grid": {},
"id": 3,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 1,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": true,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))",
"format": "time_series",
"hide": false,
"intervalFactor": 2,
"legendFormat": "{{instance}} WAL fsync",
"metric": "etcd_disk_wal_fsync_duration_seconds_bucket",
"refId": "A",
"step": 4
},
{
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} DB fsync",
"metric": "etcd_disk_backend_commit_duration_seconds_bucket",
"refId": "B",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Disk Sync Duration",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": false
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"id": 29,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "process_resident_memory_bytes",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} Resident Memory",
"metric": "process_resident_memory_bytes",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Memory",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "New row",
"titleSize": "h6"
},
{
"collapse": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 5,
"id": 22,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 3,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "rate(etcd_network_client_grpc_received_bytes_total[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} Client Traffic In",
"metric": "etcd_network_client_grpc_received_bytes_total",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Client Traffic In",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 5,
"id": 21,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 3,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "rate(etcd_network_client_grpc_sent_bytes_total[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} Client Traffic Out",
"metric": "etcd_network_client_grpc_sent_bytes_total",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Client Traffic Out",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"id": 20,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} Peer Traffic In",
"metric": "etcd_network_peer_received_bytes_total",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Peer Traffic In",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"editable": false,
"error": false,
"fill": 0,
"grid": {},
"id": 16,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{instance}} Peer Traffic Out",
"metric": "etcd_network_peer_sent_bytes_total",
"refId": "A",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Peer Traffic Out",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "New row",
"titleSize": "h6"
},
{
"collapse": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 0,
"id": 40,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(etcd_server_proposals_failed_total[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Proposal Failure Rate",
"metric": "etcd_server_proposals_failed_total",
"refId": "A",
"step": 2
},
{
"expr": "sum(etcd_server_proposals_pending)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Proposal Pending Total",
"metric": "etcd_server_proposals_pending",
"refId": "B",
"step": 2
},
{
"expr": "sum(rate(etcd_server_proposals_committed_total[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Proposal Commit Rate",
"metric": "etcd_server_proposals_committed_total",
"refId": "C",
"step": 2
},
{
"expr": "sum(rate(etcd_server_proposals_applied_total[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Proposal Apply Rate",
"refId": "D",
"step": 2
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Raft Proposals",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 0,
"editable": false,
"error": false,
"fill": 0,
"id": 19,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": false,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "changes(etcd_server_leader_changes_seen_total[1d])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} Total Leader Elections Per Day",
"metric": "etcd_server_leader_changes_seen_total",
"refId": "A",
"step": 2
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "Total Leader Elections Per Day",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "New row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-15m",
"to": "now"
},
"timepicker": {
"now": true,
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "etcd",
"version": 4
}
kubernetes-capacity-planning-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"gnetId": 22,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"refresh": false,
"rows": [
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 3,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(node_cpu{mode=\"idle\"}[2m])) * 100",
"hide": false,
"intervalFactor": 10,
"legendFormat": "",
"refId": "A",
"step": 50
}
],
"title": "Idle CPU",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": "cpu usage",
"logBase": 1,
"min": 0,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 9,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(node_load1)",
"intervalFactor": 4,
"legendFormat": "load 1m",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "sum(node_load5)",
"intervalFactor": 4,
"legendFormat": "load 5m",
"refId": "B",
"step": 20,
"target": ""
},
{
"expr": "sum(node_load15)",
"intervalFactor": 4,
"legendFormat": "load 15m",
"refId": "C",
"step": 20,
"target": ""
}
],
"title": "System Load",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "percentunit",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 4,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 9,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)",
"intervalFactor": 2,
"legendFormat": "memory usage",
"metric": "memo",
"refId": "A",
"step": 10,
"target": ""
},
{
"expr": "sum(node_memory_Buffers)",
"interval": "",
"intervalFactor": 2,
"legendFormat": "memory buffers",
"metric": "memo",
"refId": "B",
"step": 10,
"target": ""
},
{
"expr": "sum(node_memory_Cached)",
"interval": "",
"intervalFactor": 2,
"legendFormat": "memory cached",
"metric": "memo",
"refId": "C",
"step": 10,
"target": ""
},
{
"expr": "sum(node_memory_MemFree)",
"interval": "",
"intervalFactor": 2,
"legendFormat": "memory free",
"metric": "memo",
"refId": "D",
"step": 10,
"target": ""
}
],
"title": "Memory Usage",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"min": "0",
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100",
"intervalFactor": 2,
"metric": "",
"refId": "A",
"step": 60,
"target": ""
}
],
"thresholds": "80, 90",
"title": "Memory Usage",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "246px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 6,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "read",
"yaxis": 1
},
{
"alias": "{instance=\"172.17.0.1:9100\"}",
"yaxis": 2
},
{
"alias": "io time",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 9,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(node_disk_bytes_read[5m]))",
"hide": false,
"intervalFactor": 4,
"legendFormat": "read",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "sum(rate(node_disk_bytes_written[5m]))",
"intervalFactor": 4,
"legendFormat": "written",
"refId": "B",
"step": 20
},
{
"expr": "sum(rate(node_disk_io_time_ms[5m]))",
"intervalFactor": 4,
"legendFormat": "io time",
"refId": "C",
"step": 20
}
],
"title": "Disk I/O",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "ms",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percentunit",
"gauge": {
"maxValue": 1,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 12,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(node_filesystem_size{device!=\"rootfs\"}) - sum(node_filesystem_free{device!=\"rootfs\"})) / sum(node_filesystem_size{device!=\"rootfs\"})",
"intervalFactor": 2,
"refId": "A",
"step": 60,
"target": ""
}
],
"thresholds": "0.75, 0.9",
"title": "Disk Space Usage",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 8,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "transmitted",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(node_network_receive_bytes{device!~\"lo\"}[5m]))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 10,
"target": ""
}
],
"title": "Network Received",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "bytes",
"logBase": 1,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 10,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "transmitted",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(node_network_transmit_bytes{device!~\"lo\"}[5m]))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "",
"refId": "B",
"step": 10,
"target": ""
}
],
"title": "Network Transmitted",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "bytes",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "276px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 11,
"isNew": true,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 11,
"span": 9,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(kube_pod_info)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Current number of Pods",
"refId": "A",
"step": 10
},
{
"expr": "sum(kube_node_status_capacity_pods)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Maximum capacity of pods",
"refId": "B",
"step": 10
}
],
"title": "Cluster Pod Utilization",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "100 - (sum(kube_node_status_capacity_pods) - sum(kube_pod_info)) / sum(kube_node_status_capacity_pods) * 100",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 60,
"target": ""
}
],
"thresholds": "80, 90",
"title": "Pod Utilization",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Kubernetes Capacity Planning",
"version": 4
}
kubernetes-cluster-health-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"refresh": "10s",
"rows": [
{
"collapse": false,
"editable": false,
"height": "254px",
"panels": [
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 1,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(up{job=~\"apiserver|kube-scheduler|kube-controller-manager\"} == 0)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Control Plane Components Down",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "Everything UP and healthy",
"value": "null"
},
{
"op": "=",
"text": "",
"value": ""
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(ALERTS{alertstate=\"firing\",alertname!=\"DeadMansSwitch\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Alerts Firing",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 3,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(ALERTS{alertstate=\"pending\",alertname!=\"DeadMansSwitch\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "3, 5",
"title": "Alerts Pending",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 4,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "count(increase(kube_pod_container_status_restarts[1h]) > 5)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Crashlooping Pods",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": false,
"title": "Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"Ready\",status!=\"true\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Node Not Ready",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 6,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"DiskPressure\",status=\"true\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Node Disk Pressure",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Node Memory Pressure",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 8,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(kube_node_spec_unschedulable)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Nodes Unschedulable",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": false,
"title": "Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Kubernetes Cluster Health",
"version": 9
}
kubernetes-cluster-status-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"rows": [
{
"collapse": false,
"editable": false,
"height": "129px",
"panels": [
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 6,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(up{job=~\"apiserver|kube-scheduler|kube-controller-manager\"} == 0)",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Control Plane UP",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "UP",
"value": "null"
}
],
"valueName": "total"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 6,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 6,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(ALERTS{alertstate=\"firing\",alertname!=\"DeadMansSwitch\"})",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "3, 5",
"title": "Alerts Firing",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": true,
"title": "Cluster Health",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "168px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 1,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"apiserver\"} == 1) / count(up{job=\"apiserver\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "API Servers UP",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"kube-controller-manager\"} == 1) / count(up{job=\"kube-controller-manager\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "Controller Managers UP",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 3,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"kube-scheduler\"} == 1) / count(up{job=\"kube-scheduler\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "Schedulers UP",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 4,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "count(increase(kube_pod_container_status_restarts{namespace=~\"kube-system|tectonic-system\"}[1h]) > 5)",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "1, 3",
"title": "Crashlooping Control Plane Pods",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": true,
"title": "Control Plane Status",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "158px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 8,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "sum(100 - (avg by (instance) (rate(node_cpu{job=\"node-exporter\",mode=\"idle\"}[5m])) * 100)) / count(node_cpu{job=\"node-exporter\",mode=\"idle\"})",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "80, 90",
"title": "CPU Utilization",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "((sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)) / sum(node_memory_MemTotal)) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "80, 90",
"title": "Memory Utilization",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 9,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(node_filesystem_size{device!=\"rootfs\"}) - sum(node_filesystem_free{device!=\"rootfs\"})) / sum(node_filesystem_size{device!=\"rootfs\"})",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "80, 90",
"title": "Filesystem Utilization",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 10,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "100 - (sum(kube_node_status_capacity_pods) - sum(kube_pod_info)) / sum(kube_node_status_capacity_pods) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "80, 90",
"title": "Pod Utilization",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": true,
"title": "Capacity Planning",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Kubernetes Cluster Status",
"version": 3
}
kubernetes-control-plane-status-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"rows": [
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 1,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"apiserver\"} == 1) / sum(up{job=\"apiserver\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "API Servers UP",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"kube-controller-manager\"} == 1) / sum(up{job=\"kube-controller-manager\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "Controller Managers UP",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 3,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(up{job=\"kube-scheduler\"} == 1) / sum(up{job=\"kube-scheduler\"})) * 100",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"thresholds": "50, 80",
"title": "Schedulers UP",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 4,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(sum by(instance) (rate(apiserver_request_count{code=~\"5..\"}[5m])) / sum by(instance) (rate(apiserver_request_count[5m]))) * 100",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 600
}
],
"thresholds": "5, 10",
"title": "API Server Request Error Rate",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 7,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by(verb) (rate(apiserver_latency_seconds:quantile[5m]) >= 0)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 30
}
],
"title": "API Server Request Latency",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 5,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "cluster:scheduler_e2e_scheduling_latency_seconds:quantile",
"format": "time_series",
"intervalFactor": 2,
"refId": "A",
"step": 60
}
],
"title": "End to End Scheduling Latency",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"show": true
},
{
"format": "dtdurations",
"logBase": 1,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 6,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by(instance) (rate(apiserver_request_count{code!~\"2..\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Error Rate",
"refId": "A",
"step": 60
},
{
"expr": "sum by(instance) (rate(apiserver_request_count[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Request Rate",
"refId": "B",
"step": 60
}
],
"title": "API Server Request Rates",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Kubernetes Control Plane Status",
"version": 3
}
kubernetes-resource-requests-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"refresh": false,
"rows": [
{
"collapse": false,
"editable": false,
"height": "300px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"description": "This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster.\nFor comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 1,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 9,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "min(sum(kube_node_status_allocatable_cpu_cores) by (instance))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "Allocatable CPU Cores",
"refId": "A",
"step": 20
},
{
"expr": "max(sum(kube_pod_container_resource_requests_cpu_cores) by (instance))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "Requested CPU Cores",
"refId": "B",
"step": 20
}
],
"title": "CPU Cores",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "CPU Cores",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "max(sum(kube_pod_container_resource_requests_cpu_cores) by (instance)) / min(sum(kube_node_status_allocatable_cpu_cores) by (instance)) * 100",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 240
}
],
"thresholds": "80, 90",
"title": "CPU Cores",
"transparent": false,
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "CPU Cores",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "300px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"description": "This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster.\nFor comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 3,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 9,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "min(sum(kube_node_status_allocatable_memory_bytes) by (instance))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "Allocatable Memory",
"refId": "A",
"step": 20
},
{
"expr": "max(sum(kube_pod_container_resource_requests_memory_bytes) by (instance))",
"hide": false,
"intervalFactor": 2,
"legendFormat": "Requested Memory",
"refId": "B",
"step": 20
}
],
"title": "Memory",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": "Memory",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 4,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "max(sum(kube_pod_container_resource_requests_memory_bytes) by (instance)) / min(sum(kube_node_status_allocatable_memory_bytes) by (instance)) * 100",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A",
"step": 240
}
],
"thresholds": "80, 90",
"title": "Memory",
"transparent": false,
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Memory",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-3h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Kubernetes Resource Requests",
"version": 2
}
nodes-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"description": "Dashboard to get an overview of one server",
"editable": false,
"gnetId": 22,
"graphTooltip": 0,
"hideControls": false,
"links": [],
"refresh": false,
"rows": [
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 3,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "100 - (avg by (cpu) (irate(node_cpu{mode=\"idle\", instance=\"$server\"}[5m])) * 100)",
"hide": false,
"intervalFactor": 10,
"legendFormat": "{{cpu}}",
"refId": "A",
"step": 50
}
],
"title": "Idle CPU",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": "cpu usage",
"logBase": 1,
"max": 100,
"min": 0,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 9,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_load1{instance=\"$server\"}",
"intervalFactor": 4,
"legendFormat": "load 1m",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "node_load5{instance=\"$server\"}",
"intervalFactor": 4,
"legendFormat": "load 5m",
"refId": "B",
"step": 20,
"target": ""
},
{
"expr": "node_load15{instance=\"$server\"}",
"intervalFactor": 4,
"legendFormat": "load 15m",
"refId": "C",
"step": 20,
"target": ""
}
],
"title": "System Load",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "percentunit",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 4,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "node_memory_SwapFree{instance=\"172.17.0.1:9100\",job=\"prometheus\"}",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 9,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "memory used",
"metric": "",
"refId": "C",
"step": 10
},
{
"expr": "node_memory_Buffers{instance=\"$server\"}",
"interval": "",
"intervalFactor": 2,
"legendFormat": "memory buffers",
"metric": "",
"refId": "E",
"step": 10
},
{
"expr": "node_memory_Cached{instance=\"$server\"}",
"intervalFactor": 2,
"legendFormat": "memory cached",
"metric": "",
"refId": "F",
"step": 10
},
{
"expr": "node_memory_MemFree{instance=\"$server\"}",
"intervalFactor": 2,
"legendFormat": "memory free",
"metric": "",
"refId": "D",
"step": 10
}
],
"title": "Memory Usage",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"min": "0",
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "((node_memory_MemTotal{instance=\"$server\"} - node_memory_MemFree{instance=\"$server\"} - node_memory_Buffers{instance=\"$server\"} - node_memory_Cached{instance=\"$server\"}) / node_memory_MemTotal{instance=\"$server\"}) * 100",
"intervalFactor": 2,
"refId": "A",
"step": 60,
"target": ""
}
],
"thresholds": "80, 90",
"title": "Memory Usage",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 6,
"isNew": true,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "read",
"yaxis": 1
},
{
"alias": "{instance=\"172.17.0.1:9100\"}",
"yaxis": 2
},
{
"alias": "io time",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 9,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by (instance) (rate(node_disk_bytes_read{instance=\"$server\"}[2m]))",
"hide": false,
"intervalFactor": 4,
"legendFormat": "read",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "sum by (instance) (rate(node_disk_bytes_written{instance=\"$server\"}[2m]))",
"intervalFactor": 4,
"legendFormat": "written",
"refId": "B",
"step": 20
},
{
"expr": "sum by (instance) (rate(node_disk_io_time_ms{instance=\"$server\"}[2m]))",
"intervalFactor": 4,
"legendFormat": "io time",
"refId": "C",
"step": 20
}
],
"title": "Disk I/O",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "ms",
"logBase": 1,
"show": true
}
]
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": false,
"format": "percentunit",
"gauge": {
"maxValue": 1,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"hideTimeOverride": false,
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "(sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"}) - sum(node_filesystem_free{device!=\"rootfs\",instance=\"$server\"})) / sum(node_filesystem_size{device!=\"rootfs\",instance=\"$server\"})",
"intervalFactor": 2,
"refId": "A",
"step": 60,
"target": ""
}
],
"thresholds": "0.75, 0.9",
"title": "Disk Space Usage",
"transparent": false,
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 8,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "transmitted",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "rate(node_network_receive_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
"hide": false,
"intervalFactor": 2,
"legendFormat": "{{device}}",
"refId": "A",
"step": 10,
"target": ""
}
],
"title": "Network Received",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "bytes",
"logBase": 1,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 10,
"isNew": false,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "transmitted",
"yaxis": 2
}
],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "rate(node_network_transmit_bytes{instance=\"$server\",device!~\"lo\"}[5m])",
"hide": false,
"intervalFactor": 2,
"legendFormat": "{{device}}",
"refId": "B",
"step": 10,
"target": ""
}
],
"title": "Network Transmitted",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "bytes",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": null,
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": null,
"multi": false,
"name": "server",
"options": [],
"query": "label_values(node_boot_time, instance)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Nodes",
"version": 2
}
pods-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 1,
"hideControls": false,
"links": [],
"refresh": false,
"rows": [
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 1,
"isNew": false,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by(container_name) (container_memory_usage_bytes{pod_name=\"$pod\", container_name=~\"$container\", container_name!=\"POD\"})",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "Current: {{ container_name }}",
"metric": "container_memory_usage_bytes",
"refId": "A",
"step": 15
},
{
"expr": "kube_pod_container_resource_requests_memory_bytes{pod=\"$pod\", container=~\"$container\"}",
"interval": "10s",
"intervalFactor": 2,
"legendFormat": "Requested: {{ container }}",
"metric": "kube_pod_container_resource_requests_memory_bytes",
"refId": "B",
"step": 20
},
{
"expr": "kube_pod_container_resource_limits_memory_bytes{pod=\"$pod\", container=~\"$container\"}",
"interval": "10s",
"intervalFactor": 2,
"legendFormat": "Limit: {{ container }}",
"metric": "kube_pod_container_resource_limits_memory_bytes",
"refId": "C",
"step": 20
}
],
"title": "Memory Usage",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 2,
"isNew": false,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by (container_name)(rate(container_cpu_usage_seconds_total{image!=\"\",container_name!=\"POD\",pod_name=\"$pod\"}[1m]))",
"intervalFactor": 2,
"legendFormat": "{{ container_name }}",
"refId": "A",
"step": 30
},
{
"expr": "kube_pod_container_resource_requests_cpu_cores{pod=\"$pod\", container=~\"$container\"}",
"interval": "10s",
"intervalFactor": 2,
"legendFormat": "Requested: {{ container }}",
"metric": "kube_pod_container_resource_requests_cpu_cores",
"refId": "B",
"step": 20
},
{
"expr": "kube_pod_container_resource_limits_cpu_cores{pod=\"$pod\", container=~\"$container\"}",
"interval": "10s",
"intervalFactor": 2,
"legendFormat": "Limit: {{ container }}",
"metric": "kube_pod_container_resource_limits_memory_bytes",
"refId": "C",
"step": 20
}
],
"title": "CPU Usage",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "250px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 3,
"isNew": false,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sort_desc(sum by (pod_name) (rate(container_network_receive_bytes_total{pod_name=\"$pod\"}[1m])))",
"intervalFactor": 2,
"legendFormat": "{{ pod_name }}",
"refId": "A",
"step": 30
}
],
"title": "Network I/O",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"logBase": 1,
"show": true
},
{
"format": "short",
"logBase": 1,
"show": true
}
]
}
],
"showTitle": false,
"title": "New Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": ".*",
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": true,
"label": "Namespace",
"multi": false,
"name": "namespace",
"options": [],
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": "Pod",
"multi": false,
"name": "pod",
"options": [],
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": ".*",
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": true,
"label": "Container",
"multi": false,
"name": "container",
"options": [],
"query": "label_values(kube_pod_container_info{namespace=\"$namespace\", pod=\"$pod\"}, container)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Pods",
"version": 1
}
statefulset-dashboard.json: |+
{
"__inputs": [
{
"description": "",
"label": "prometheus",
"name": "prometheus",
"pluginId": "prometheus",
"pluginName": "Prometheus",
"type": "datasource"
}
],
"annotations": {
"list": []
},
"editable": false,
"graphTooltip": 1,
"hideControls": false,
"links": [],
"rows": [
{
"collapse": false,
"editable": false,
"height": "200px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 8,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "cores",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$statefulset_namespace\",pod_name=~\"$statefulset_name.*\"}[3m]))",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "CPU",
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 9,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "GB",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "80%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(container_memory_usage_bytes{namespace=\"$statefulset_namespace\",pod_name=~\"$statefulset_name.*\"}) / 1024^3",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Memory",
"type": "singlestat",
"valueFontSize": "110%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "Bps",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": false
},
"id": 7,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"targets": [
{
"expr": "sum(rate(container_network_transmit_bytes_total{namespace=\"$statefulset_namespace\",pod_name=~\"$statefulset_name.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{namespace=\"$statefulset_namespace\",pod_name=~\"$statefulset_name.*\"}[3m]))",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Network",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "100px",
"panels": [
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": false
},
"id": 5,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_statefulset_replicas{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"metric": "kube_statefulset_replicas",
"refId": "A",
"step": 600
}
],
"title": "Desired Replicas",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 6,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "min(kube_statefulset_status_replicas{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Available Replicas",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 3,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_statefulset_status_observed_generation{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Observed Generation",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"editable": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 2,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"targets": [
{
"expr": "max(kube_statefulset_metadata_generation{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"refId": "A",
"step": 600
}
],
"title": "Metadata Generation",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"editable": false,
"height": "350px",
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": false,
"error": false,
"fill": 1,
"grid": {
"threshold1Color": "rgba(216, 200, 27, 0.27)",
"threshold2Color": "rgba(234, 112, 112, 0.22)"
},
"id": 1,
"isNew": true,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 12,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "min(kube_statefulset_status_replicas{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "available",
"refId": "B",
"step": 30
},
{
"expr": "max(kube_statefulset_replicas{statefulset=\"$statefulset_name\",namespace=\"$statefulset_namespace\"}) without (instance, pod)",
"intervalFactor": 2,
"legendFormat": "desired",
"refId": "E",
"step": 30
}
],
"title": "Replicas",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"mode": "time",
"show": true,
"values": []
},
"yaxes": [
{
"format": "none",
"label": "",
"logBase": 1,
"show": true
},
{
"format": "short",
"label": "",
"logBase": 1,
"show": false
}
]
}
],
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"sharedCrosshair": false,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": ".*",
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": "Namespace",
"multi": false,
"name": "statefulset_namespace",
"options": [],
"query": "label_values(kube_statefulset_metadata_generation, namespace)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": null,
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {},
"datasource": "prometheus",
"hide": 0,
"includeAll": false,
"label": "StatefulSet",
"multi": false,
"name": "statefulset_name",
"options": [],
"query": "label_values(kube_statefulset_metadata_generation{namespace=\"$statefulset_namespace\"}, statefulset)",
"refresh": 1,
"regex": "",
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "statefulset",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "StatefulSet",
"version": 1
}
---

View File

@ -0,0 +1,16 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
prometheus.yaml: |+
apiVersion: 1
datasources:
- name: prometheus
type: prometheus
access: proxy
orgId: 1
url: http://prometheus.monitoring.svc.cluster.local
version: 1
editable: false

View File

@ -0,0 +1,60 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: grafana
phase: prod
template:
metadata:
labels:
name: grafana
phase: prod
spec:
containers:
- name: grafana
image: grafana/grafana:5.0.4
env:
- name: GF_SERVER_HTTP_PORT
value: "8080"
- name: GF_AUTH_BASIC_ENABLED
value: "false"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ORG_ROLE
value: Viewer
ports:
- name: http
containerPort: 8080
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 200Mi
cpu: 200m
volumeMounts:
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
- name: dashboard-providers
mountPath: /etc/grafana/provisioning/dashboards
- name: dashboards
mountPath: /var/lib/grafana/dashboards
volumes:
- name: datasources
configMap:
name: grafana-datasources
- name: dashboard-providers
configMap:
name: grafana-dashboard-providers
- name: dashboards
configMap:
name: grafana-dashboards

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: ClusterIP
selector:
name: grafana
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: heapster
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:heapster
subjects:
- kind: ServiceAccount
name: heapster
namespace: kube-system

View File

@ -0,0 +1,60 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: heapster
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
name: heapster
phase: prod
template:
metadata:
labels:
name: heapster
phase: prod
spec:
serviceAccountName: heapster
containers:
- name: heapster
image: k8s.gcr.io/heapster-amd64:v1.5.2
command:
- /heapster
- --source=kubernetes.summary_api:''
livenessProbe:
httpGet:
path: /healthz
port: 8082
scheme: HTTP
initialDelaySeconds: 180
timeoutSeconds: 5
- name: heapster-nanny
image: k8s.gcr.io/addon-resizer:1.7
command:
- /pod_nanny
- --cpu=80m
- --extra-cpu=0.5m
- --memory=140Mi
- --extra-memory=4Mi
- --threshold=5
- --deployment=heapster
- --container=heapster
- --poll-period=300000
- --estimator=exponential
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: 50m
memory: 90Mi
requests:
cpu: 50m
memory: 90Mi

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: heapster
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: system:pod-nanny
subjects:
- kind: ServiceAccount
name: heapster
namespace: kube-system

19
addons/heapster/role.yaml Normal file
View File

@ -0,0 +1,19 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: system:pod-nanny
namespace: kube-system
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- "extensions"
resources:
- deployments
verbs:
- get
- update

View File

@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: heapster
namespace: kube-system

View File

@ -0,0 +1,12 @@
apiVersion: v1
kind: Service
metadata:
name: heapster
namespace: kube-system
spec:
type: ClusterIP
selector:
name: heapster
ports:
- port: 80
targetPort: 8082

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: ingress

View File

@ -0,0 +1,40 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -0,0 +1,71 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: nginx-ingress-controller
phase: prod
template:
metadata:
labels:
name: nginx-ingress-controller
phase: prod
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
hostPort: 80
- name: https
containerPort: 443
hostPort: 443
- name: health
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,51 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- nodes
- pods
- secrets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- services
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- "extensions"
resources:
- ingresses/status
verbs:
- update

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ingress
namespace: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,41 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ingress
namespace: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- pods
- secrets
verbs:
- get
- apiGroups:
- ""
resources:
- configmaps
resourceNames:
# Defaults to "<election-id>-<ingress-class>"
# Here: "<ingress-controller-leader>-<nginx>"
# This has to be adapted if you change either parameter
# when launching the nginx-ingress-controller.
- "ingress-controller-leader-public"
verbs:
- get
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
- create
- update

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
type: ClusterIP
selector:
name: nginx-ingress-controller
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 80
- name: https
protocol: TCP
port: 443
targetPort: 443

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: ingress

View File

@ -0,0 +1,71 @@
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: nginx-ingress-controller
phase: prod
template:
metadata:
labels:
name: nginx-ingress-controller
phase: prod
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
hostPort: 80
- name: https
containerPort: 443
hostPort: 443
- name: health
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,40 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,51 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- nodes
- pods
- secrets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- services
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- "extensions"
resources:
- ingresses/status
verbs:
- update

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ingress
namespace: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,41 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ingress
namespace: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- pods
- secrets
verbs:
- get
- apiGroups:
- ""
resources:
- configmaps
resourceNames:
# Defaults to "<election-id>-<ingress-class>"
# Here: "<ingress-controller-leader>-<nginx>"
# This has to be adapted if you change either parameter
# when launching the nginx-ingress-controller.
- "ingress-controller-leader-public"
verbs:
- get
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
- create
- update

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
type: ClusterIP
selector:
name: nginx-ingress-controller
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 80
- name: https
protocol: TCP
port: 443
targetPort: 443

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: ingress

View File

@ -0,0 +1,40 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -0,0 +1,71 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: nginx-ingress-controller
phase: prod
template:
metadata:
labels:
name: nginx-ingress-controller
phase: prod
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
hostPort: 80
- name: https
containerPort: 443
hostPort: 443
- name: health
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,51 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- nodes
- pods
- secrets
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- services
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- "extensions"
resources:
- ingresses/status
verbs:
- update

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ingress
namespace: ingress
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ingress
subjects:
- kind: ServiceAccount
namespace: ingress
name: default

View File

@ -0,0 +1,41 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ingress
namespace: ingress
rules:
- apiGroups:
- ""
resources:
- configmaps
- pods
- secrets
verbs:
- get
- apiGroups:
- ""
resources:
- configmaps
resourceNames:
# Defaults to "<election-id>-<ingress-class>"
# Here: "<ingress-controller-leader>-<nginx>"
# This has to be adapted if you change either parameter
# when launching the nginx-ingress-controller.
- "ingress-controller-leader-public"
verbs:
- get
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
- create
- update

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: nginx-ingress-controller
namespace: ingress
spec:
type: ClusterIP
selector:
name: nginx-ingress-controller
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 80
- name: https
protocol: TCP
port: 443
targetPort: 443

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: monitoring

View File

@ -0,0 +1,245 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yaml: |-
# Global config
global:
scrape_interval: 15s
# AlertManager
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Scrape configs for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.
scrape_configs:
# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# Using endpoints to discover kube-apiserver targets finds the pod IP
# (host IP since apiserver uses host network) which is not used in
# the server certificate.
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- replacement: apiserver
action: replace
target_label: job
# Scrape config for node (i.e. kubelet) /metrics (e.g. 'kubelet_'). Explore
# metrics from a node by scraping kubelet (127.0.0.1:10255/metrics).
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape config for Kubelet cAdvisor. Explore metrics from a node by
# scraping kubelet (127.0.0.1:10255/metrics/cadvisor).
#
# This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
# (those whose names begin with 'container_') have been removed from the
# Kubelet metrics endpoint. This job scrapes the cAdvisor endpoint to
# retrieve those metrics.
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# Scrap etcd metrics from controllers
- job_name: 'etcd'
kubernetes_sd_configs:
- role: node
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_node_role_kubernetes_io_controller]
action: keep
regex: 'true'
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
action: replace
target_label: __address__
replacement: '${1}:2381'
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: job
# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: job
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Rule files
rule_files:
- "/etc/prometheus/rules/*.rules"
- "/etc/prometheus/rules/*.yaml"
- "/etc/prometheus/rules/*.yml"

View File

@ -0,0 +1,45 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
name: prometheus
phase: prod
template:
metadata:
labels:
name: prometheus
phase: prod
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: quay.io/prometheus/prometheus:v2.2.1
args:
- '--config.file=/etc/prometheus/prometheus.yaml'
ports:
- name: web
containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
- name: data
mountPath: /var/lib/prometheus
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
- name: data
emptyDir: {}

View File

@ -0,0 +1,18 @@
apiVersion: v1
kind: Service
metadata:
name: kube-controller-manager
namespace: kube-system
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scrape endpoints
clusterIP: None
selector:
k8s-app: kube-controller-manager
ports:
- name: metrics
protocol: TCP
port: 10252
targetPort: 10252

View File

@ -0,0 +1,18 @@
apiVersion: v1
kind: Service
metadata:
name: kube-scheduler
namespace: kube-system
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scrape endpoints
clusterIP: None
selector:
k8s-app: kube-scheduler
ports:
- name: metrics
protocol: TCP
port: 10251
targetPort: 10251

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,39 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]

View File

@ -0,0 +1,61 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: kube-state-metrics
phase: prod
template:
metadata:
labels:
name: kube-state-metrics
phase: prod
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/coreos/kube-state-metrics:v1.3.1
ports:
- name: metrics
containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: k8s.gcr.io/addon-resizer:1.7
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 30Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
command:
- /pod_nanny
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,15 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
namespace: monitoring
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]

View File

@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scape endpoints
clusterIP: None
selector:
name: kube-state-metrics
phase: prod
ports:
- name: metrics
protocol: TCP
port: 8080
targetPort: 8080

View File

@ -0,0 +1,60 @@
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: node-exporter
phase: prod
template:
metadata:
labels:
name: node-exporter
phase: prod
spec:
serviceAccountName: node-exporter
securityContext:
runAsNonRoot: true
runAsUser: 65534
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:v0.15.2
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
ports:
- name: metrics
containerPort: 9100
hostPort: 9100
resources:
requests:
memory: 30Mi
cpu: 100m
limits:
memory: 50Mi
cpu: 200m
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys

View File

@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-exporter
namespace: monitoring

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scape endpoints
clusterIP: None
selector:
name: node-exporter
phase: prod
ports:
- name: metrics
protocol: TCP
port: 80
targetPort: 9100

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring

View File

@ -0,0 +1,15 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]

View File

@ -0,0 +1,578 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
alertmanager.rules.yaml: |
groups:
- name: alertmanager.rules
rules:
- alert: AlertmanagerConfigInconsistent
expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service)
GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service",
"alertmanager-$1", "alertmanager", "(.*)") != 1
for: 5m
labels:
severity: critical
annotations:
description: The configuration of the instances of the Alertmanager cluster
`{{$labels.service}}` are out of sync.
- alert: AlertmanagerDownOrMissing
expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1",
"alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1
for: 5m
labels:
severity: warning
annotations:
description: An unexpected number of Alertmanagers are scraped or Alertmanagers
disappeared from discovery.
- alert: AlertmanagerFailedReload
expr: alertmanager_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace
}}/{{ $labels.pod}}.
etcd3.rules.yaml: |
groups:
- name: ./etcd3.rules
rules:
- alert: InsufficientMembers
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: critical
annotations:
description: If one more etcd member goes down the cluster will be unavailable
summary: etcd cluster insufficient members
- alert: NoLeader
expr: etcd_server_has_leader{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
description: etcd member {{ $labels.instance }} has no leader
summary: etcd member has no leader
- alert: HighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader
changes within the last hour
summary: a high number of leader changes within the etcd cluster are happening
- alert: GRPCRequestsSlow
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le))
> 0.15
for: 10m
labels:
severity: critical
annotations:
description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method
}} are slow
summary: slow gRPC requests
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.05
for: 5m
labels:
severity: critical
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HTTPRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
}} are slow
summary: slow HTTP requests
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} member communication with
{{ $labels.To }} is slow
summary: etcd member communication is slow
- alert: HighNumberOfFailedProposals
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal
failures within the last hour
summary: a high number of proposals within the etcd cluster are failing
- alert: HighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
> 0.5
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} fync durations are high
summary: high fsync durations
- alert: HighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
> 0.25
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} commit durations are high
summary: high commit durations
general.rules.yaml: |
groups:
- name: general.rules
rules:
- alert: TargetDown
expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of {{ $labels.job }} targets are down.'
summary: Targets are down
- record: fd_utilization
expr: process_open_fds / process_max_fds
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1
for: 10m
labels:
severity: warning
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next 4 hours'
summary: file descriptors soon exhausted
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[10m], 3600) > 1
for: 10m
labels:
severity: critical
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next hour'
summary: file descriptors soon exhausted
kube-controller-manager.rules.yaml: |
groups:
- name: kube-controller-manager.rules
rules:
- alert: K8SControllerManagerDown
expr: absent(up{job="kube-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S controller manager. Deployments and replication
controllers are not making progress.
summary: Controller manager is down
kube-scheduler.rules.yaml: |
groups:
- name: kube-scheduler.rules
rules:
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- alert: K8SSchedulerDown
expr: absent(up{job="kube-scheduler"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S scheduler. New pods are not being assigned
to nodes.
summary: Scheduler is down
kube-state-metrics.rules.yaml: |
groups:
- name: kube-state-metrics.rules
rules:
- alert: DeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 15m
labels:
severity: warning
annotations:
description: Observed deployment generation does not match expected one for
deployment {{$labels.namespaces}}/{{$labels.deployment}}
summary: Deployment is outdated
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 15m
labels:
severity: warning
annotations:
description: Replicas are not updated and available for deployment {{$labels.namespaces}}/{{$labels.deployment}}
summary: Deployment replicas are outdated
- alert: DaemonSetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled
* 100 < 100
for: 15m
labels:
severity: warning
annotations:
description: Only {{$value}}% of desired pods scheduled and ready for daemon
set {{$labels.namespaces}}/{{$labels.daemonset}}
summary: DaemonSet is missing pods
- alert: K8SDaemonSetsNotScheduled
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
> 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are not scheduled.
summary: Daemonsets are not scheduled correctly
- alert: DaemonSetsMissScheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are running where they are not supposed
to run.
summary: Daemonsets are not scheduled correctly
- alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
description: Pod {{$labels.namespaces}}/{{$labels.pod}} is was restarted {{$value}}
times within the last hour
summary: Pod is restarting frequently
kubelet.rules.yaml: |
groups:
- name: kubelet.rules
rules:
- alert: K8SNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1h
labels:
severity: warning
annotations:
description: The Kubelet on {{ $labels.node }} has not checked in with the API,
or has set itself to NotReady, for more than an hour
summary: Node status is NotReady
- alert: K8SManyNodesNotReady
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
for: 1m
labels:
severity: critical
annotations:
description: '{{ $value }}% of Kubernetes nodes are not ready'
- alert: K8SKubeletDown
expr: count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
for: 1h
labels:
severity: warning
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets.
- alert: K8SKubeletDown
expr: (absent(up{job="kubelet"} == 1) or count(up{job="kubelet"} == 0) / count(up{job="kubelet"}))
* 100 > 10
for: 1h
labels:
severity: critical
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets, or all Kubelets
have disappeared from service discovery.
summary: Many Kubelets cannot be scraped
- alert: K8SKubeletTooManyPods
expr: kubelet_running_pod_count > 100
for: 10m
labels:
severity: warning
annotations:
description: Kubelet {{$labels.instance}} is running {{$value}} pods, close
to the limit of 110
summary: Kubelet is close to pod limit
kubernetes.rules.yaml: |
groups:
- name: kubernetes.rules
rules:
- record: pod_name:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(pod_name)
- record: pod_name:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: pod_name:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
BY (pod_name)
- record: pod_name:container_fs_usage_bytes:sum
expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: namespace:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace)
- record: namespace:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace)
- record: namespace:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m]))
BY (namespace)
- record: cluster:memory_usage:ratio
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(cluster) / sum(machine_memory_bytes) BY (cluster)
- record: cluster:container_spec_cpu_shares:ratio
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000
/ sum(machine_cpu_cores)
- record: cluster:container_cpu_usage:ratio
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
/ sum(machine_cpu_cores)
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.99"
- record: apiserver_latency:quantile_seconds
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.9"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.5"
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 4
for: 10m
labels:
severity: critical
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 2
for: 10m
labels:
severity: warning
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: K8SApiserverDown
expr: absent(up{job="apiserver"} == 1)
for: 20m
labels:
severity: critical
annotations:
description: No API servers are reachable or all have disappeared from service
discovery
- alert: K8sCertificateExpirationNotice
labels:
severity: warning
annotations:
description: Kubernetes API Certificate is expiring soon (less than 7 days)
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="604800"}) > 0
- alert: K8sCertificateExpirationNotice
labels:
severity: critical
annotations:
description: Kubernetes API Certificate is expiring in less than 1 day
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="86400"}) > 0
node.rules.yaml: |
groups:
- name: node.rules
rules:
- record: instance:node_cpu:rate:sum
expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
BY (instance)
- record: instance:node_filesystem_usage:sum
expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
BY (instance)
- record: instance:node_network_receive_bytes:rate:sum
expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
- record: instance:node_network_transmit_bytes:rate:sum
expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
- record: instance:node_cpu:ratio
expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
- record: cluster:node_cpu:sum_rate5m
expr: sum(rate(node_cpu{mode!="idle"}[5m]))
- record: cluster:node_cpu:ratio
expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
- alert: NodeExporterDown
expr: absent(up{job="node-exporter"} == 1)
for: 10m
labels:
severity: warning
annotations:
description: Prometheus could not scrape a node-exporter for more than 10m,
or node-exporters have disappeared from discovery
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
for: 30m
labels:
severity: warning
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 24 hours (mounted at {{$labels.mountpoint}})
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
for: 10m
labels:
severity: critical
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 2 hours (mounted at {{$labels.mountpoint}})
prometheus.rules.yaml: |
groups:
- name: prometheus.rules
rules:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}
- alert: PrometheusNotificationQueueRunningFull
expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
for: 10m
labels:
severity: warning
annotations:
description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{
$labels.pod}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.01
for: 10m
labels:
severity: warning
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.03
for: 10m
labels:
severity: critical
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusNotConnectedToAlertmanagers
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 10m
labels:
severity: warning
annotations:
description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
to any Alertmanagers
- alert: PrometheusTSDBReloadsFailing
expr: increase(prometheus_tsdb_reloads_failures_total[2h]) > 0
for: 12h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
reload failures over the last four hours.'
summary: Prometheus has issues reloading data blocks from disk
- alert: PrometheusTSDBCompactionsFailing
expr: increase(prometheus_tsdb_compactions_failed_total[2h]) > 0
for: 12h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
compaction failures over the last four hours.'
summary: Prometheus has issues compacting sample blocks
- alert: PrometheusTSDBWALCorruptions
expr: tsdb_wal_corruptions_total > 0
for: 4h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead
log (WAL).'
summary: Prometheus write-ahead log is corrupted
- alert: PrometheusNotIngestingSamples
expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0
for: 10m
labels:
severity: warning
annotations:
description: "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples."
summary: "Prometheus isn't ingesting samples"

View File

@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring

View File

@ -0,0 +1,17 @@
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
selector:
name: prometheus
phase: prod
ports:
- name: web
protocol: TCP
port: 80
targetPort: 9090

View File

@ -0,0 +1,23 @@
The MIT License (MIT)
Copyright (c) 2017 Typhoon Authors
Copyright (c) 2017 Dalton Hubble
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

View File

@ -0,0 +1,23 @@
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
* Minimal, stable base Kubernetes distribution
* Declarative infrastructure and configuration
* Free (freedom and cost) and privacy-respecting
* Practical for labs, datacenters, and clouds
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.10.2 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/)
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs
Please see the [official docs](https://typhoon.psdn.io) and the AWS [tutorial](https://typhoon.psdn.io/aws/).

View File

@ -0,0 +1,19 @@
data "aws_ami" "coreos" {
most_recent = true
owners = ["595879546273"]
filter {
name = "architecture"
values = ["x86_64"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
filter {
name = "name"
values = ["CoreOS-${var.os_channel}-*"]
}
}

View File

@ -0,0 +1,69 @@
# Network Load Balancer DNS Record
resource "aws_route53_record" "apiserver" {
zone_id = "${var.dns_zone_id}"
name = "${format("%s.%s.", var.cluster_name, var.dns_zone)}"
type = "A"
# AWS recommends their special "alias" records for ELBs
alias {
name = "${aws_lb.apiserver.dns_name}"
zone_id = "${aws_lb.apiserver.zone_id}"
evaluate_target_health = true
}
}
# Network Load Balancer for apiservers
resource "aws_lb" "apiserver" {
name = "${var.cluster_name}-apiserver"
load_balancer_type = "network"
internal = false
subnets = ["${aws_subnet.public.*.id}"]
enable_cross_zone_load_balancing = true
}
# Forward TCP traffic to controllers
resource "aws_lb_listener" "apiserver-https" {
load_balancer_arn = "${aws_lb.apiserver.arn}"
protocol = "TCP"
port = "443"
default_action {
type = "forward"
target_group_arn = "${aws_lb_target_group.controllers.arn}"
}
}
# Target group of controllers
resource "aws_lb_target_group" "controllers" {
name = "${var.cluster_name}-controllers"
vpc_id = "${aws_vpc.network.id}"
target_type = "instance"
protocol = "TCP"
port = 443
# TCP health check for apiserver
health_check {
protocol = "TCP"
port = 443
# NLBs required to use same healthy and unhealthy thresholds
healthy_threshold = 3
unhealthy_threshold = 3
# Interval between health checks required to be 10 or 30
interval = 10
}
}
# Attach controller instances to apiserver NLB
resource "aws_lb_target_group_attachment" "controllers" {
count = "${var.controller_count}"
target_group_arn = "${aws_lb_target_group.controllers.arn}"
target_id = "${element(aws_instance.controllers.*.id, count.index)}"
port = 443
}

View File

@ -0,0 +1,14 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=911f4115088b7511f29221f64bf8e93bfa9ee567"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["${aws_route53_record.etcds.*.fqdn}"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}

View File

@ -1,6 +1,30 @@
---
systemd:
units:
- name: etcd-member.service
enable: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.4"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_LISTEN_METRICS_URLS=http://0.0.0.0:2381"
Environment="ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt"
Environment="ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt"
Environment="ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
- name: locksmithd.service
@ -18,43 +42,54 @@ systemd:
ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
[Install]
RequiredBy=kubelet.service
RequiredBy=etcd-member.service
- name: kubelet.service
enable: true
contents: |
[Unit]
Description=Kubelet via Hyperkube ACI
Description=Kubelet via Hyperkube
Wants=rpc-statd.service
[Service]
EnvironmentFile=/etc/kubernetes/kubelet.env
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/run/kubelet-pod.uuid \
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/cache/kubelet-pod.uuid \
--volume=resolv,kind=host,source=/etc/resolv.conf \
--mount volume=resolv,target=/etc/resolv.conf \
--volume var-lib-cni,kind=host,source=/var/lib/cni \
--mount volume=var-lib-cni,target=/var/lib/cni \
--volume var-lib-calico,kind=host,source=/var/lib/calico \
--mount volume=var-lib-calico,target=/var/lib/calico \
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
ExecStartPre=/bin/mkdir -p /etc/kubernetes/inactive-manifests
ExecStartPre=/bin/mkdir -p /var/lib/cni
ExecStartPre=/bin/mkdir -p /var/lib/calico
ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volumeplugins
ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/run/kubelet-pod.uuid
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/cache/kubelet-pod.uuid
ExecStart=/usr/lib/coreos/kubelet-wrapper \
--kubeconfig=/etc/kubernetes/kubeconfig \
--require-kubeconfig \
--client-ca-file=/etc/kubernetes/ca.crt \
--anonymous-auth=false \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--network-plugin=cni \
--lock-file=/var/run/lock/kubelet.lock \
--exit-on-lock-contention \
--pod-manifest-path=/etc/kubernetes/manifests \
--allow-privileged \
--node-labels=node-role.kubernetes.io/master \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--anonymous-auth=false \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_domain=cluster.local
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/run/kubelet-pod.uuid
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--kubeconfig=/etc/kubernetes/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=10
[Install]
@ -79,29 +114,14 @@ storage:
mode: 0644
contents:
inline: |
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: ${kubeconfig_server}
certificate-authority-data: ${kubeconfig_ca_cert}
users:
- name: kubelet
user:
client-certificate-data: ${kubeconfig_kubelet_cert}
client-key-data: ${kubeconfig_kubelet_key}
contexts:
- context:
cluster: local
user: kubelet
${kubeconfig}
- path: /etc/kubernetes/kubelet.env
filesystem: root
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.1_coreos.0
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.10.2
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -120,10 +140,9 @@ storage:
# Wrapper for bootkube start
set -e
# Move experimental manifests
[ -d /opt/bootkube/assets/experimental/manifests ] && mv /opt/bootkube/assets/experimental/manifests/* /opt/bootkube/assets/manifests && rm -r /opt/bootkube/assets/experimental/manifests
[ -d /opt/bootkube/assets/experimental/bootstrap-manifests ] && mv /opt/bootkube/assets/experimental/bootstrap-manifests/* /opt/bootkube/assets/bootstrap-manifests && rm -r /opt/bootkube/assets/experimental/bootstrap-manifests
[ -n "$(ls /opt/bootkube/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootkube/assets/manifests-*/* /opt/bootkube/assets/manifests && rm -rf /opt/bootkube/assets/manifests-*
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.6.0}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.12.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
@ -140,4 +159,4 @@ passwd:
users:
- name: core
ssh_authorized_keys:
- "${ssh_authorized_keys}"
- "${ssh_authorized_key}"

View File

@ -0,0 +1,82 @@
# Discrete DNS records for each controller's private IPv4 for etcd usage
resource "aws_route53_record" "etcds" {
count = "${var.controller_count}"
# DNS Zone where record should be created
zone_id = "${var.dns_zone_id}"
name = "${format("%s-etcd%d.%s.", var.cluster_name, count.index, var.dns_zone)}"
type = "A"
ttl = 300
# private IPv4 address for etcd
records = ["${element(aws_instance.controllers.*.private_ip, count.index)}"]
}
# Controller instances
resource "aws_instance" "controllers" {
count = "${var.controller_count}"
tags = {
Name = "${var.cluster_name}-controller-${count.index}"
}
instance_type = "${var.controller_type}"
ami = "${data.aws_ami.coreos.image_id}"
user_data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
# storage
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
}
# network
associate_public_ip_address = true
subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
vpc_security_group_ids = ["${aws_security_group.controller.id}"]
lifecycle {
ignore_changes = ["ami"]
}
}
# Controller Container Linux Config
data "template_file" "controller_config" {
count = "${var.controller_count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars = {
# Cannot use cyclic dependencies on controllers or their DNS records
etcd_name = "etcd${count.index}"
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", null_resource.repeat.*.triggers.name, null_resource.repeat.*.triggers.domain))}"
kubeconfig = "${indent(10, module.bootkube.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
# Horrible hack to generate a Terraform list of a desired length without dependencies.
# Ideal ${repeat("etcd", 3) -> ["etcd", "etcd", "etcd"]}
resource null_resource "repeat" {
count = "${var.controller_count}"
triggers {
name = "etcd${count.index}"
domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
count = "${var.controller_count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
snippets = ["${var.controller_clc_snippets}"]
}

View File

@ -0,0 +1,57 @@
data "aws_availability_zones" "all" {}
# Network VPC, gateway, and routes
resource "aws_vpc" "network" {
cidr_block = "${var.host_cidr}"
assign_generated_ipv6_cidr_block = true
enable_dns_support = true
enable_dns_hostnames = true
tags = "${map("Name", "${var.cluster_name}")}"
}
resource "aws_internet_gateway" "gateway" {
vpc_id = "${aws_vpc.network.id}"
tags = "${map("Name", "${var.cluster_name}")}"
}
resource "aws_route_table" "default" {
vpc_id = "${aws_vpc.network.id}"
route {
cidr_block = "0.0.0.0/0"
gateway_id = "${aws_internet_gateway.gateway.id}"
}
route {
ipv6_cidr_block = "::/0"
gateway_id = "${aws_internet_gateway.gateway.id}"
}
tags = "${map("Name", "${var.cluster_name}")}"
}
# Subnets (one per availability zone)
resource "aws_subnet" "public" {
count = "${length(data.aws_availability_zones.all.names)}"
vpc_id = "${aws_vpc.network.id}"
availability_zone = "${data.aws_availability_zones.all.names[count.index]}"
cidr_block = "${cidrsubnet(var.host_cidr, 4, count.index)}"
ipv6_cidr_block = "${cidrsubnet(aws_vpc.network.ipv6_cidr_block, 8, count.index)}"
map_public_ip_on_launch = true
assign_ipv6_address_on_creation = true
tags = "${map("Name", "${var.cluster_name}-public-${count.index}")}"
}
resource "aws_route_table_association" "public" {
count = "${length(data.aws_availability_zones.all.names)}"
route_table_id = "${aws_route_table.default.id}"
subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
}

View File

@ -0,0 +1,25 @@
output "ingress_dns_name" {
value = "${module.workers.ingress_dns_name}"
description = "DNS name of the network load balancer for distributing traffic to Ingress controllers"
}
# Outputs for worker pools
output "vpc_id" {
value = "${aws_vpc.network.id}"
description = "ID of the VPC for creating worker instances"
}
output "subnet_ids" {
value = ["${aws_subnet.public.*.id}"]
description = "List of subnet IDs for creating worker instances"
}
output "worker_security_groups" {
value = ["${aws_security_group.worker.id}"]
description = "List of worker security group IDs"
}
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
}

View File

@ -0,0 +1,25 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "aws" {
version = "~> 1.11"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

View File

@ -0,0 +1,395 @@
# Security Groups (instance firewalls)
# Controller security group
resource "aws_security_group" "controller" {
name = "${var.cluster_name}-controller"
description = "${var.cluster_name} controller security group"
vpc_id = "${aws_vpc.network.id}"
tags = "${map("Name", "${var.cluster_name}-controller")}"
}
resource "aws_security_group_rule" "controller-icmp" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "icmp"
from_port = 0
to_port = 0
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "controller-ssh" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 22
to_port = 22
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "controller-apiserver" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "controller-etcd" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 2379
to_port = 2380
self = true
}
resource "aws_security_group_rule" "controller-etcd-metrics" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 2381
to_port = 2381
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-flannel" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-flannel-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
resource "aws_security_group_rule" "controller-node-exporter" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 9100
to_port = 9100
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-kubelet-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10250
to_port = 10250
self = true
}
resource "aws_security_group_rule" "controller-kubelet-read" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-kubelet-read-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "controller-bgp" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 179
to_port = 179
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-bgp-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 179
to_port = 179
self = true
}
resource "aws_security_group_rule" "controller-ipip" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = 4
from_port = 0
to_port = 0
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-ipip-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = 4
from_port = 0
to_port = 0
self = true
}
resource "aws_security_group_rule" "controller-ipip-legacy" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = 94
from_port = 0
to_port = 0
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-ipip-legacy-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = 94
from_port = 0
to_port = 0
self = true
}
resource "aws_security_group_rule" "controller-egress" {
security_group_id = "${aws_security_group.controller.id}"
type = "egress"
protocol = "-1"
from_port = 0
to_port = 0
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}
# Worker security group
resource "aws_security_group" "worker" {
name = "${var.cluster_name}-worker"
description = "${var.cluster_name} worker security group"
vpc_id = "${aws_vpc.network.id}"
tags = "${map("Name", "${var.cluster_name}-worker")}"
}
resource "aws_security_group_rule" "worker-icmp" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "icmp"
from_port = 0
to_port = 0
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-ssh" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 22
to_port = 22
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-http" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 80
to_port = 80
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-https" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 443
to_port = 443
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-flannel" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-flannel-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
resource "aws_security_group_rule" "worker-node-exporter" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 9100
to_port = 9100
self = true
}
resource "aws_security_group_rule" "ingress-health" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10254
to_port = 10254
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-kubelet" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10250
to_port = 10250
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-kubelet-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10250
to_port = 10250
self = true
}
resource "aws_security_group_rule" "worker-kubelet-read" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-kubelet-read-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "worker-bgp" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 179
to_port = 179
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-bgp-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 179
to_port = 179
self = true
}
resource "aws_security_group_rule" "worker-ipip" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = 4
from_port = 0
to_port = 0
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-ipip-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = 4
from_port = 0
to_port = 0
self = true
}
resource "aws_security_group_rule" "worker-ipip-legacy" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = 94
from_port = 0
to_port = 0
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-ipip-legacy-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = 94
from_port = 0
to_port = 0
self = true
}
resource "aws_security_group_rule" "worker-egress" {
security_group_id = "${aws_security_group.worker.id}"
type = "egress"
protocol = "-1"
from_port = 0
to_port = 0
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}

View File

@ -0,0 +1,91 @@
# Secure copy etcd TLS assets to controllers.
resource "null_resource" "copy-controller-secrets" {
count = "${var.controller_count}"
connection {
type = "ssh"
host = "${element(aws_instance.controllers.*.public_ip, count.index)}"
user = "core"
timeout = "15m"
}
provisioner "file" {
content = "${module.bootkube.etcd_ca_cert}"
destination = "$HOME/etcd-client-ca.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_cert}"
destination = "$HOME/etcd-client.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_key}"
destination = "$HOME/etcd-client.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_cert}"
destination = "$HOME/etcd-server.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_key}"
destination = "$HOME/etcd-server.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_cert}"
destination = "$HOME/etcd-peer.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_key}"
destination = "$HOME/etcd-peer.key"
}
provisioner "remote-exec" {
inline = [
"sudo mkdir -p /etc/ssl/etcd/etcd",
"sudo mv etcd-client* /etc/ssl/etcd/",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/server-ca.crt",
"sudo mv etcd-server.crt /etc/ssl/etcd/etcd/server.crt",
"sudo mv etcd-server.key /etc/ssl/etcd/etcd/server.key",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/peer-ca.crt",
"sudo mv etcd-peer.crt /etc/ssl/etcd/etcd/peer.crt",
"sudo mv etcd-peer.key /etc/ssl/etcd/etcd/peer.key",
"sudo chown -R etcd:etcd /etc/ssl/etcd",
"sudo chmod -R 500 /etc/ssl/etcd",
]
}
}
# Secure copy bootkube assets to ONE controller and start bootkube to perform
# one-time self-hosted cluster bootstrapping.
resource "null_resource" "bootkube-start" {
depends_on = [
"module.bootkube",
"module.workers",
"aws_route53_record.apiserver",
"null_resource.copy-controller-secrets",
]
connection {
type = "ssh"
host = "${aws_instance.controllers.0.public_ip}"
user = "core"
timeout = "15m"
}
provisioner "file" {
source = "${var.asset_dir}"
destination = "$HOME/assets"
}
provisioner "remote-exec" {
inline = [
"sudo mv $HOME/assets /opt/bootkube",
"sudo systemctl start bootkube",
]
}
}

View File

@ -0,0 +1,124 @@
variable "cluster_name" {
type = "string"
description = "Unique cluster name (prepended to dns_zone)"
}
# AWS
variable "dns_zone" {
type = "string"
description = "AWS Route53 DNS Zone (e.g. aws.example.com)"
}
variable "dns_zone_id" {
type = "string"
description = "AWS Route53 DNS Zone ID (e.g. Z3PAABBCFAKEC0)"
}
# instances
variable "controller_count" {
type = "string"
default = "1"
description = "Number of controllers (i.e. masters)"
}
variable "worker_count" {
type = "string"
default = "1"
description = "Number of workers"
}
variable "controller_type" {
type = "string"
default = "t2.small"
description = "EC2 instance type for controllers"
}
variable "worker_type" {
type = "string"
default = "t2.small"
description = "EC2 instance type for workers"
}
variable "os_channel" {
type = "string"
default = "stable"
description = "Container Linux AMI channel (stable, beta, alpha)"
}
variable "disk_size" {
type = "string"
default = "40"
description = "Size of the EBS volume in GB"
}
variable "disk_type" {
type = "string"
default = "gp2"
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "controller_clc_snippets" {
type = "list"
description = "Controller Container Linux Config snippets"
default = []
}
variable "worker_clc_snippets" {
type = "list"
description = "Worker Container Linux Config snippets"
default = []
}
# configuration
variable "ssh_authorized_key" {
type = "string"
description = "SSH public key for user 'core'"
}
variable "asset_dir" {
description = "Path to a directory where generated assets should be placed (contains secrets)"
type = "string"
}
variable "networking" {
description = "Choice of networking provider (calico or flannel)"
type = "string"
default = "calico"
}
variable "network_mtu" {
description = "CNI interface MTU (applies to calico only). Use 8981 if using instances types with Jumbo frames."
type = "string"
default = "1480"
}
variable "host_cidr" {
description = "CIDR IPv4 range to assign to EC2 nodes"
type = "string"
default = "10.0.0.0/16"
}
variable "pod_cidr" {
description = "CIDR IPv4 range to assign Kubernetes pods"
type = "string"
default = "10.2.0.0/16"
}
variable "service_cidr" {
description = <<EOD
CIDR IPv4 range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"
default = "10.3.0.0/16"
}
variable "cluster_domain_suffix" {
description = "Queries for domains with the suffix will be answered by kube-dns. Default is cluster.local (e.g. foo.default.svc.cluster.local) "
type = "string"
default = "cluster.local"
}

View File

@ -0,0 +1,20 @@
module "workers" {
source = "workers"
name = "${var.cluster_name}"
# AWS
vpc_id = "${aws_vpc.network.id}"
subnet_ids = ["${aws_subnet.public.*.id}"]
security_groups = ["${aws_security_group.worker.id}"]
count = "${var.worker_count}"
instance_type = "${var.worker_type}"
os_channel = "${var.os_channel}"
disk_size = "${var.disk_size}"
# configuration
kubeconfig = "${module.bootkube.kubeconfig}"
ssh_authorized_key = "${var.ssh_authorized_key}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
clc_snippets = "${var.worker_clc_snippets}"
}

View File

@ -0,0 +1,19 @@
data "aws_ami" "coreos" {
most_recent = true
owners = ["595879546273"]
filter {
name = "architecture"
values = ["x86_64"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
filter {
name = "name"
values = ["CoreOS-${var.os_channel}-*"]
}
}

View File

@ -22,38 +22,45 @@ systemd:
enable: true
contents: |
[Unit]
Description=Kubelet via Hyperkube ACI
Description=Kubelet via Hyperkube
Wants=rpc-statd.service
[Service]
EnvironmentFile=/etc/kubernetes/kubelet.env
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/run/kubelet-pod.uuid \
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/cache/kubelet-pod.uuid \
--volume=resolv,kind=host,source=/etc/resolv.conf \
--mount volume=resolv,target=/etc/resolv.conf \
--volume var-lib-cni,kind=host,source=/var/lib/cni \
--mount volume=var-lib-cni,target=/var/lib/cni \
--volume var-lib-calico,kind=host,source=/var/lib/calico \
--mount volume=var-lib-calico,target=/var/lib/calico \
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
ExecStartPre=/bin/mkdir -p /etc/kubernetes/inactive-manifests
ExecStartPre=/bin/mkdir -p /var/lib/cni
ExecStartPre=/bin/mkdir -p /var/lib/calico
ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volumeplugins
ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/run/kubelet-pod.uuid
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/cache/kubelet-pod.uuid
ExecStart=/usr/lib/coreos/kubelet-wrapper \
--kubeconfig=/etc/kubernetes/kubeconfig \
--require-kubeconfig \
--client-ca-file=/etc/kubernetes/ca.crt \
--anonymous-auth=false \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--network-plugin=cni \
--lock-file=/var/run/lock/kubelet.lock \
--exit-on-lock-contention \
--pod-manifest-path=/etc/kubernetes/manifests \
--allow-privileged \
--node-labels=node-role.kubernetes.io/node \
--anonymous-auth=false \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_domain=cluster.local
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/run/kubelet-pod.uuid
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--kubeconfig=/etc/kubernetes/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=5
[Install]
@ -77,29 +84,14 @@ storage:
mode: 0644
contents:
inline: |
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: ${kubeconfig_server}
certificate-authority-data: ${kubeconfig_ca_cert}
users:
- name: kubelet
user:
client-certificate-data: ${kubeconfig_kubelet_cert}
client-key-data: ${kubeconfig_kubelet_key}
contexts:
- context:
cluster: local
user: kubelet
${kubeconfig}
- path: /etc/kubernetes/kubelet.env
filesystem: root
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.1_coreos.0
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.10.2
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -116,7 +108,8 @@ storage:
--trust-keys-from-https \
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
quay.io/coreos/hyperkube:v1.7.1_coreos.0 \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.10.2 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -0,0 +1,82 @@
# Network Load Balancer for Ingress
resource "aws_lb" "ingress" {
name = "${var.name}-ingress"
load_balancer_type = "network"
internal = false
subnets = ["${var.subnet_ids}"]
enable_cross_zone_load_balancing = true
}
# Forward HTTP traffic to workers
resource "aws_lb_listener" "ingress-http" {
load_balancer_arn = "${aws_lb.ingress.arn}"
protocol = "TCP"
port = 80
default_action {
type = "forward"
target_group_arn = "${aws_lb_target_group.workers-http.arn}"
}
}
# Forward HTTPS traffic to workers
resource "aws_lb_listener" "ingress-https" {
load_balancer_arn = "${aws_lb.ingress.arn}"
protocol = "TCP"
port = 443
default_action {
type = "forward"
target_group_arn = "${aws_lb_target_group.workers-https.arn}"
}
}
# Network Load Balancer target groups of instances
resource "aws_lb_target_group" "workers-http" {
name = "${var.name}-workers-http"
vpc_id = "${var.vpc_id}"
target_type = "instance"
protocol = "TCP"
port = 80
# HTTP health check for ingress
health_check {
protocol = "HTTP"
port = 10254
path = "/healthz"
# NLBs required to use same healthy and unhealthy thresholds
healthy_threshold = 3
unhealthy_threshold = 3
# Interval between health checks required to be 10 or 30
interval = 10
}
}
resource "aws_lb_target_group" "workers-https" {
name = "${var.name}-workers-https"
vpc_id = "${var.vpc_id}"
target_type = "instance"
protocol = "TCP"
port = 443
# HTTP health check for ingress
health_check {
protocol = "HTTP"
port = 10254
path = "/healthz"
# NLBs required to use same healthy and unhealthy thresholds
healthy_threshold = 3
unhealthy_threshold = 3
# Interval between health checks required to be 10 or 30
interval = 10
}
}

View File

@ -0,0 +1,4 @@
output "ingress_dns_name" {
value = "${aws_lb.ingress.dns_name}"
description = "DNS name of the network load balancer for distributing traffic to Ingress controllers"
}

View File

@ -0,0 +1,87 @@
variable "name" {
type = "string"
description = "Unique name for the worker pool"
}
# AWS
variable "vpc_id" {
type = "string"
description = "Must be set to `vpc_id` output by cluster"
}
variable "subnet_ids" {
type = "list"
description = "Must be set to `subnet_ids` output by cluster"
}
variable "security_groups" {
type = "list"
description = "Must be set to `worker_security_groups` output by cluster"
}
# instances
variable "count" {
type = "string"
default = "1"
description = "Number of instances"
}
variable "instance_type" {
type = "string"
default = "t2.small"
description = "EC2 instance type"
}
variable "os_channel" {
type = "string"
default = "stable"
description = "Container Linux AMI channel (stable, beta, alpha)"
}
variable "disk_size" {
type = "string"
default = "40"
description = "Size of the EBS volume in GB"
}
variable "disk_type" {
type = "string"
default = "gp2"
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "clc_snippets" {
type = "list"
description = "Container Linux Config snippets"
default = []
}
# configuration
variable "kubeconfig" {
type = "string"
description = "Must be set to `kubeconfig` output by cluster"
}
variable "ssh_authorized_key" {
type = "string"
description = "SSH public key for user 'core'"
}
variable "service_cidr" {
description = <<EOD
CIDR IPv4 range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"
default = "10.3.0.0/16"
}
variable "cluster_domain_suffix" {
description = "Queries for domains with the suffix will be answered by kube-dns. Default is cluster.local (e.g. foo.default.svc.cluster.local) "
type = "string"
default = "cluster.local"
}

View File

@ -0,0 +1,75 @@
# Workers AutoScaling Group
resource "aws_autoscaling_group" "workers" {
name = "${var.name}-worker ${aws_launch_configuration.worker.name}"
# count
desired_capacity = "${var.count}"
min_size = "${var.count}"
max_size = "${var.count + 2}"
default_cooldown = 30
health_check_grace_period = 30
# network
vpc_zone_identifier = ["${var.subnet_ids}"]
# template
launch_configuration = "${aws_launch_configuration.worker.name}"
# target groups to which instances should be added
target_group_arns = [
"${aws_lb_target_group.workers-http.id}",
"${aws_lb_target_group.workers-https.id}",
]
lifecycle {
# override the default destroy and replace update behavior
create_before_destroy = true
}
tags = [{
key = "Name"
value = "${var.name}-worker"
propagate_at_launch = true
}]
}
# Worker template
resource "aws_launch_configuration" "worker" {
image_id = "${data.aws_ami.coreos.image_id}"
instance_type = "${var.instance_type}"
user_data = "${data.ct_config.worker_ign.rendered}"
# storage
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
}
# network
security_groups = ["${var.security_groups}"]
lifecycle {
// Override the default destroy and replace update behavior
create_before_destroy = true
ignore_changes = ["image_id"]
}
}
# Worker Container Linux Config
data "template_file" "worker_config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
kubeconfig = "${indent(10, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
data "ct_config" "worker_ign" {
content = "${data.template_file.worker_config.rendered}"
pretty_print = false
snippets = ["${var.clc_snippets}"]
}

View File

@ -0,0 +1,23 @@
The MIT License (MIT)
Copyright (c) 2017 Typhoon Authors
Copyright (c) 2017 Dalton Hubble
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

View File

@ -0,0 +1,23 @@
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
* Minimal, stable base Kubernetes distribution
* Declarative infrastructure and configuration
* Free (freedom and cost) and privacy-respecting
* Practical for labs, datacenters, and clouds
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.10.2 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/)
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs
Please see the [official docs](https://typhoon.psdn.io) and the AWS [tutorial](https://typhoon.psdn.io/aws/).

View File

@ -0,0 +1,19 @@
data "aws_ami" "fedora" {
most_recent = true
owners = ["125523088429"]
filter {
name = "architecture"
values = ["x86_64"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
filter {
name = "name"
values = ["Fedora-Atomic-27-20180419.0.x86_64-*-gp2-*"]
}
}

View File

@ -0,0 +1,69 @@
# Network Load Balancer DNS Record
resource "aws_route53_record" "apiserver" {
zone_id = "${var.dns_zone_id}"
name = "${format("%s.%s.", var.cluster_name, var.dns_zone)}"
type = "A"
# AWS recommends their special "alias" records for ELBs
alias {
name = "${aws_lb.apiserver.dns_name}"
zone_id = "${aws_lb.apiserver.zone_id}"
evaluate_target_health = true
}
}
# Network Load Balancer for apiservers
resource "aws_lb" "apiserver" {
name = "${var.cluster_name}-apiserver"
load_balancer_type = "network"
internal = false
subnets = ["${aws_subnet.public.*.id}"]
enable_cross_zone_load_balancing = true
}
# Forward TCP traffic to controllers
resource "aws_lb_listener" "apiserver-https" {
load_balancer_arn = "${aws_lb.apiserver.arn}"
protocol = "TCP"
port = "443"
default_action {
type = "forward"
target_group_arn = "${aws_lb_target_group.controllers.arn}"
}
}
# Target group of controllers
resource "aws_lb_target_group" "controllers" {
name = "${var.cluster_name}-controllers"
vpc_id = "${aws_vpc.network.id}"
target_type = "instance"
protocol = "TCP"
port = 443
# TCP health check for apiserver
health_check {
protocol = "TCP"
port = 443
# NLBs required to use same healthy and unhealthy thresholds
healthy_threshold = 3
unhealthy_threshold = 3
# Interval between health checks required to be 10 or 30
interval = 10
}
}
# Attach controller instances to apiserver NLB
resource "aws_lb_target_group_attachment" "controllers" {
count = "${var.controller_count}"
target_group_arn = "${aws_lb_target_group.controllers.arn}"
target_id = "${element(aws_instance.controllers.*.id, count.index)}"
port = 443
}

View File

@ -0,0 +1,17 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=911f4115088b7511f29221f64bf8e93bfa9ee567"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["${aws_route53_record.etcds.*.fqdn}"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
# Fedora
trusted_certs_dir = "/etc/pki/tls/certs"
}

View File

@ -0,0 +1,107 @@
#cloud-config
write_files:
- path: /etc/etcd/etcd.conf
content: |
ETCD_NAME=${etcd_name}
ETCD_DATA_DIR=/var/lib/etcd
ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380
ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379
ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
ETCD_LISTEN_METRICS_URLS=http://0.0.0.0:2381
ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}
ETCD_STRICT_RECONFIG_CHECK=true
ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt
ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt
ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key
ETCD_CLIENT_CERT_AUTH=true
ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt
ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key
ETCD_PEER_CLIENT_CERT_AUTH=true
- path: /etc/systemd/system/cloud-metadata.service
content: |
[Unit]
Description=Cloud metadata agent
[Service]
Type=oneshot
Environment=OUTPUT=/run/metadata/cloud
ExecStart=/usr/bin/mkdir -p /run/metadata
ExecStart=/usr/bin/bash -c 'echo "HOSTNAME_OVERRIDE=$(curl\
--url http://169.254.169.254/latest/meta-data/local-ipv4\
--retry 10)" > $${OUTPUT}'
[Install]
WantedBy=multi-user.target
- path: /etc/systemd/system/kubelet.service.d/10-typhoon.conf
content: |
[Unit]
Requires=cloud-metadata.service
After=cloud-metadata.service
Wants=rpc-statd.service
[Service]
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
ExecStartPre=/bin/mkdir -p /etc/kubernetes/inactive-manifests
ExecStartPre=/bin/mkdir -p /var/lib/cni
ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volumeplugins
ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"
Restart=always
RestartSec=10
- path: /etc/kubernetes/kubelet.conf
content: |
ARGS="--allow-privileged \
--anonymous-auth=false \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--kubeconfig=/etc/kubernetes/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/kubernetes/kubeconfig
permissions: '0644'
content: |
${kubeconfig}
- path: /var/lib/bootkube/.keep
- path: /etc/NetworkManager/conf.d/typhoon.conf
content: |
[main]
plugins=keyfile
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*
- path: /etc/selinux/config
owner: root:root
permissions: '0644'
content: |
SELINUX=permissive
SELINUXTYPE=targeted
bootcmd:
- [setenforce, Permissive]
- [systemctl, disable, firewalld, --now]
# https://github.com/kubernetes/kubernetes/issues/60869
- [modprobe, ip_vs]
runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, NetworkManager]
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.4"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.10.2"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.12.0"
- [systemctl, start, --no-block, etcd.service]
- [systemctl, enable, cloud-metadata.service]
- [systemctl, start, --no-block, kubelet.service]
users:
- default
- name: fedora
gecos: Fedora Admin
sudo: ALL=(ALL) NOPASSWD:ALL
groups: wheel,adm,systemd-journal,docker
ssh-authorized-keys:
- "${ssh_authorized_key}"

View File

@ -0,0 +1,75 @@
# Discrete DNS records for each controller's private IPv4 for etcd usage
resource "aws_route53_record" "etcds" {
count = "${var.controller_count}"
# DNS Zone where record should be created
zone_id = "${var.dns_zone_id}"
name = "${format("%s-etcd%d.%s.", var.cluster_name, count.index, var.dns_zone)}"
type = "A"
ttl = 300
# private IPv4 address for etcd
records = ["${element(aws_instance.controllers.*.private_ip, count.index)}"]
}
# Controller instances
resource "aws_instance" "controllers" {
count = "${var.controller_count}"
tags = {
Name = "${var.cluster_name}-controller-${count.index}"
}
instance_type = "${var.controller_type}"
ami = "${data.aws_ami.fedora.image_id}"
user_data = "${element(data.template_file.controller-cloudinit.*.rendered, count.index)}"
# storage
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
}
# network
associate_public_ip_address = true
subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
vpc_security_group_ids = ["${aws_security_group.controller.id}"]
lifecycle {
ignore_changes = ["ami"]
}
}
# Controller Cloud-Init
data "template_file" "controller-cloudinit" {
count = "${var.controller_count}"
template = "${file("${path.module}/cloudinit/controller.yaml.tmpl")}"
vars = {
# Cannot use cyclic dependencies on controllers or their DNS records
etcd_name = "etcd${count.index}"
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", null_resource.repeat.*.triggers.name, null_resource.repeat.*.triggers.domain))}"
kubeconfig = "${indent(6, module.bootkube.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
# Horrible hack to generate a Terraform list of a desired length without dependencies.
# Ideal ${repeat("etcd", 3) -> ["etcd", "etcd", "etcd"]}
resource null_resource "repeat" {
count = "${var.controller_count}"
triggers {
name = "etcd${count.index}"
domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
}
}

View File

@ -0,0 +1,57 @@
data "aws_availability_zones" "all" {}
# Network VPC, gateway, and routes
resource "aws_vpc" "network" {
cidr_block = "${var.host_cidr}"
assign_generated_ipv6_cidr_block = true
enable_dns_support = true
enable_dns_hostnames = true
tags = "${map("Name", "${var.cluster_name}")}"
}
resource "aws_internet_gateway" "gateway" {
vpc_id = "${aws_vpc.network.id}"
tags = "${map("Name", "${var.cluster_name}")}"
}
resource "aws_route_table" "default" {
vpc_id = "${aws_vpc.network.id}"
route {
cidr_block = "0.0.0.0/0"
gateway_id = "${aws_internet_gateway.gateway.id}"
}
route {
ipv6_cidr_block = "::/0"
gateway_id = "${aws_internet_gateway.gateway.id}"
}
tags = "${map("Name", "${var.cluster_name}")}"
}
# Subnets (one per availability zone)
resource "aws_subnet" "public" {
count = "${length(data.aws_availability_zones.all.names)}"
vpc_id = "${aws_vpc.network.id}"
availability_zone = "${data.aws_availability_zones.all.names[count.index]}"
cidr_block = "${cidrsubnet(var.host_cidr, 4, count.index)}"
ipv6_cidr_block = "${cidrsubnet(aws_vpc.network.ipv6_cidr_block, 8, count.index)}"
map_public_ip_on_launch = true
assign_ipv6_address_on_creation = true
tags = "${map("Name", "${var.cluster_name}-public-${count.index}")}"
}
resource "aws_route_table_association" "public" {
count = "${length(data.aws_availability_zones.all.names)}"
route_table_id = "${aws_route_table.default.id}"
subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
}

View File

@ -0,0 +1,25 @@
output "ingress_dns_name" {
value = "${module.workers.ingress_dns_name}"
description = "DNS name of the network load balancer for distributing traffic to Ingress controllers"
}
# Outputs for worker pools
output "vpc_id" {
value = "${aws_vpc.network.id}"
description = "ID of the VPC for creating worker instances"
}
output "subnet_ids" {
value = ["${aws_subnet.public.*.id}"]
description = "List of subnet IDs for creating worker instances"
}
output "worker_security_groups" {
value = ["${aws_security_group.worker.id}"]
description = "List of worker security group IDs"
}
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
}

View File

@ -0,0 +1,25 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "aws" {
version = "~> 1.11"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

Some files were not shown because too many files have changed in this diff Show More