Compare commits

...

95 Commits

Author SHA1 Message Date
444363be2d Update Kubernetes from v1.19.1 to v1.19.2
* Update flannel from v0.12.0 to v0.13.0-rc2
* Update flannel-cni from v0.4.0 to v0.4.1
* Update CNI plugins from v0.8.6 to v0.8.7
2020-09-16 20:05:54 -07:00
bc7ad25c60 Update Grafana dashboard for Kubelet v1.19
* Fix Kubelet pod and container count metrics dashboard
* https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/499
2020-09-15 23:21:56 -07:00
e838d4dc3d Refresh Prometheus rules/alerts and Grafana dashboards
* Refresh upstream Prometheus rules/alerts and Grafana dashboards
2020-09-13 15:03:27 -07:00
979c092ef6 Reduce apiserver metrics cardinality of non-core APIs
* Reduce `apiserver_request_duration_seconds_count` cardinality
by dropping series for non-core Kubernetes APIs. This is done
to match `apiserver_request_duration_seconds_count` relabeling
* These two relabels must be performed the same way to avoid
affecting new SLO calculations (upcoming)
* See https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/498

Related: https://github.com/poseidon/typhoon/pull/596
2020-09-13 14:47:49 -07:00
db8e94bb4b Update recommended Terraform provider versions
* Sync Terraform provider plugins with those used internally
2020-09-12 19:41:15 -07:00
eb093af9ed Drop Kubelet labelmap relabel for node_name
* Originally, Kubelet and CAdvisor metrics used a labelmap
relabel to add Kubernetes SD node labels onto timeseries
* With https://github.com/poseidon/typhoon/pull/596 that
relabel was dropped since node labels aren't usually that
valuable. `__meta_kubernetes_node_name` was retained but
the field name is empty
* Favor just using Prometheus server-side `instance` in
queries that require some node identifier for aggregation
or debugging

Fix https://github.com/poseidon/typhoon/issues/823
2020-09-12 19:40:00 -07:00
36096f844d Promote Cilium from experimental to GA
* Cilium was added as an experimental CNI provider in June
* Since then, I've been choosing it for an increasing number
of clusters and scenarios.
2020-09-12 19:24:55 -07:00
d236628e53 Update Prometheus from v2.20.0 to v2.21.0
* https://github.com/prometheus/prometheus/releases/tag/v2.21.0
2020-09-12 19:20:54 -07:00
577b927a2b Update Fedora CoreOS Config version from v1.0.0 to v1.1.0
* No notable changes in the config spec, just house keeping
* Require any snippets customization to update to v1.1.0. Version
skew between the main config and snippets will show an err message
* https://github.com/coreos/fcct/blob/master/docs/configuration-v1_1.md
2020-09-10 23:38:40 -07:00
000c11edf6 Update IngressClass resources to networking.k8s.io/v1
* Kubernetes v1.19 graduated Ingress and IngressClass from
networking.k8s.io/v1beta1 to networking.k8s.io/v1
2020-09-10 23:25:53 -07:00
29b16c3fc0 Change seccomp annotations to seccompProfile
* seccomp graduated to GA in Kubernetes v1.19. Support for
seccomp alpha annotations will be removed in v1.22
* Replace seccomp annotations with the GA seccompProfile
field in the PodTemplate securityContext
* Switch profile from `docker/default` to `runtime/default`
(no effective change, since docker is the runtime)
* Verify with docker inspect SecurityOpt. Without the profile,
you'd see `seccomp=unconfined`

Related: https://github.com/poseidon/terraform-render-bootstrap/pull/215
2020-09-10 01:15:07 -07:00
0c7a879bc4 Update Kubernetes from v1.19.0 to v1.19.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#v1191
2020-09-09 20:52:29 -07:00
1e654c9e4e Update recommended Terraform provider versions
* Sync Terraform provider plugins with those used internally
* Update mkdocs-material from v5.5.11 to v5.5.12
2020-09-07 21:18:47 -07:00
28ee693e6b Update Cilium from v1.8.2 to v1.8.3
* https://github.com/cilium/cilium/releases/tag/v1.8.3
2020-09-07 21:10:27 -07:00
8c7d95aefd Update mkdocs-material from v5.5.9 to v5.5.11 2020-08-29 13:52:16 -07:00
d45dfdbf91 Update nginx-ingress from v0.34.1 to v0.35.0
* Repo changed to k8s.gcr.io/ingress-nginx/controller
* https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.35.0
2020-08-29 13:38:28 -07:00
d7e0536838 Add code group blocks to improve worker pool docs
* Show Fedora CoreOS and Flatcar Linux examples in
separate tabs, rather than trying to show one
* Add copyright footer for the poseidon org
2020-08-28 00:25:12 -07:00
8dd221a57c Add fleetlock docs and links to addons
* Add links to fleetlock for Fedora CoreOS reboot coordination
* https://github.com/poseidon/fleetlock
2020-08-28 00:02:24 -07:00
f17bb4cf61 Update mkdocs-material from v5.5.6 to v5.5.9 2020-08-27 09:20:18 -07:00
44f1fe620a Update recommended Terraform provider versions
* Sync Terraform provider plugins with those used internally
2020-08-27 09:18:39 -07:00
a504264e24 Update Grafana from v7.1.4 to v7.1.5
* https://github.com/grafana/grafana/releases/tag/v7.1.5
2020-08-27 08:52:07 -07:00
88cf7273dc Update Kubernetes from v1.18.8 to v1.19.0
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md
2020-08-27 08:50:01 -07:00
58def65a09 Update Grafana from v7.1.3 to v7.1.4
* https://github.com/grafana/grafana/releases/tag/v7.1.4
2020-08-22 15:40:09 -07:00
cd7fd29194 Update etcd from v3.4.10 to v3.4.12
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md
2020-08-19 21:25:41 -07:00
aafa38476a Fix SELinux race condition on non-bootstrap controllers in multi-controller (#808)
* Fix race condition for bootstrap-secrets SELinux context on non-bootstrap controllers in multi-controller FCOS clusters
* On first boot from disk on non-bootstrap controllers, adding bootstrap-secrets races with kubelet.service starting, which can cause the secrets assets to have the wrong label until kubelet.service restarts (service, reboot, auto-update)
* This can manifest as `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler` pods crashlooping on spare controllers on first cluster creation
2020-08-19 21:18:10 -07:00
9a07f1d30b Update recommended Terraform provider versions
* Sync Terraform provider plugin versions to those used
internally
* Update mkdocs-material from v5.5.1 to v5.5.6
* Fix minor details in docs
2020-08-14 10:05:52 -07:00
c87db3ef37 Update Kubernetes from v1.18.6 to v1.18.8
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1188
2020-08-13 20:47:43 -07:00
342380cfa4 Update Terraform migration guide SHA
* Mention the first master branch SHA that introduced Terraform
v0.13 forward compatibility
* Link the migration guide on Github until a release is available
and website docs are published
2020-08-13 00:36:47 -07:00
5e70d7e2c8 Migrate from Terraform v0.12.x to v0.13.x
* Recommend Terraform v0.13.x
* Support automatic install of poseidon's provider plugins
* Update tutorial docs for Terraform v0.13.x
* Add migration guide for Terraform v0.13.x (best-effort)
* Require Terraform v0.12.26+ (migration compatibility)
* Require `terraform-provider-ct` v0.6.1
* Require `terraform-provider-matchbox` v0.4.1
* Require `terraform-provider-digitalocean` v1.20+

Related:

* https://www.hashicorp.com/blog/announcing-hashicorp-terraform-0-13/
* https://www.terraform.io/upgrade-guides/0-13.html
* https://registry.terraform.io/providers/poseidon/ct/latest
* https://registry.terraform.io/providers/poseidon/matchbox/latest
2020-08-12 01:54:32 -07:00
aab071309f Update recommended Terraform provider versions
* Sync Terraform provider plugin versions to those used
internally
2020-08-09 12:40:22 -07:00
f6ce12766b Allow terraform-provider-aws v3.0+ plugin
* Typhoon AWS is compatible with terraform-provider-aws v3.x releases
* Continue to allow v2.23+, no v3.x specific features are used
* Set required provider versions in the worker module, since
it can be used independently

Related:

* https://github.com/terraform-providers/terraform-provider-aws/releases/tag/v3.0.0
2020-08-09 12:39:26 -07:00
e1d6ab2f24 Update Grafana from v7.1.1 to v7.1.3
* https://github.com/grafana/grafana/releases/tag/v7.1.3
* https://github.com/grafana/grafana/releases/tag/v7.1.2
2020-08-08 18:59:49 -07:00
8b3d41d6a0 Update mkdocs-material from v5.4.0 to v5.5.1 2020-08-02 15:22:10 -07:00
ccee5d3d89 Update from coreos/flannel-cni to poseidon/flannel-cni
* Update CNI plugins from v0.6.0 to v0.8.6 to fix several CVEs
* Update the base image to alpine:3.12
* Use `flannel-cni` as an init container and remove sleep
* https://github.com/poseidon/terraform-render-bootstrap/pull/205
* https://github.com/poseidon/flannel-cni
* https://quay.io/repository/poseidon/flannel-cni

Background

* Switch from github.com/coreos/flannel-cni v0.3.0 which was last
published by me in 2017 and is no longer accessible to me to maintain
or patch
* Port to the poseidon/flannel-cni rewrite, which releases v0.4.0
to continue the prior release numbering
2020-08-02 15:13:15 -07:00
8aefd4f082 Relex terraform-provider-matchbox version constraint
* Allow use of terraform-provider-matchbox v0.3+ (which
allows v0.3.0 <= version < v1.0) for any pre 1.0 release
* Before, the requirement was v0.3.0 <= version < v0.4.0
2020-08-02 01:09:28 -07:00
78e6409bd0 Fix flannel support on Fedora CoreOS
* Fedora CoreOS now ships systemd-udev's `default.link` while
Flannel relies on being able to pick its own MAC address for
the `flannel.1` link for tunneled traffic to reach cni0 on
the destination side, without being dropped
* This change first appeared in FCOS testing-devel 32.20200624.20.1
and is the behavior going forward in FCOS since it was added
to align FCOS network naming / configs with the rest of Fedora
and address issues related to the default being missing
* Flatcar Linux (and Container Linux) has a specific flannel.link
configuration builtin, so it was not affected
* https://github.com/coreos/fedora-coreos-tracker/issues/574#issuecomment-665487296

Note: Typhoon's recommended and default CNI provider is Calico,
unless `networking` is set to flannel directly.
2020-08-01 21:22:08 -07:00
2aef42d4f6 Update Prometheus from v2.19.2 to v2.20.0
* https://github.com/prometheus/prometheus/releases/tag/v2.20.0
2020-07-25 16:37:28 -07:00
b7d67757de Update Grafana from v7.1.0 to v7.1.1
* https://github.com/grafana/grafana/releases/tag/v7.1.1
2020-07-25 16:33:40 -07:00
26f5d2d753 Fix some links in docs (#788) 2020-07-25 16:32:08 -07:00
cd0a28904e Update Cilium from v1.8.1 to v1.8.2
* https://github.com/cilium/cilium/releases/tag/v1.8.2
2020-07-25 16:06:27 -07:00
618f8b30fd Update CoreDNS from v1.6.7 to v1.7.0
* https://coredns.io/2020/06/15/coredns-1.7.0-release/
* Update Grafana dashboard with revised metrics names
2020-07-25 15:51:31 -07:00
264d23a1b5 Declare etcd data directory permissions
* Set etcd data directory /var/lib/etcd permissions to 700
* On Flatcar Linux, /var/lib/etcd is pre-existing and Ignition
v2 doesn't overwrite the directory. Update the Container Linux
config, but add the manual chmod workaround to bootstrap for
Flatcar Linux users
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v3410-2020-07-16
* https://github.com/etcd-io/etcd/pull/11798
2020-07-25 15:48:27 -07:00
f96e91f225 Update etcd from v3.4.9 to v3.4.10
* https://github.com/etcd-io/etcd/releases/tag/v3.4.10
2020-07-18 14:08:22 -07:00
efd4a0319d Update Grafana from v7.0.6 to v7.1.0
* https://github.com/grafana/grafana/releases/tag/v7.1.0
2020-07-18 13:54:56 -07:00
6df6bf904a Show Cilium as a CNI provider option in docs
* Start to show Cilium as a CNI option
* https://github.com/cilium/cilium
2020-07-18 13:27:56 -07:00
5fba20d358 Update recommended Terraform provider versions
* Sync Terraform provider plugin versions with those
used internally
2020-07-18 13:19:25 -07:00
a8d3d3bb12 Update ingress-nginx from v0.33.0 to v0.34.1
* Switch to ingress-nginx controller images from us.grc.io (eu, asia
can also be used if desired)
* https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.34.1
* https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.34.0
2020-07-15 22:43:49 -07:00
9ea6d2c245 Update Kubernetes from v1.18.5 to v1.18.6
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1186
* https://github.com/poseidon/terraform-render-bootstrap/pull/201
2020-07-15 22:05:57 -07:00
507aac9b78 Update mkdocs-material from v5.3.3 to v5.4.0 2020-07-11 22:56:59 -07:00
dfd2a0ec23 Update Grafana from v7.0.5 to v7.0.6
* https://github.com/grafana/grafana/releases/tag/v7.0.6
2020-07-09 21:10:48 -07:00
e3bf7d8f9b Update Prometheus from v2.19.1 to v2.19.2
* https://github.com/prometheus/prometheus/releases/tag/v2.19.2
2020-07-09 21:08:55 -07:00
49050320ce Update Cilium from v1.8.0 to v1.8.1
* https://github.com/cilium/cilium/releases/tag/v1.8.1
2020-07-05 16:00:00 -07:00
74e025c9e4 Update Grafana from v7.0.4 to v7.0.5
* https://github.com/grafana/grafana/releases/tag/v7.0.5
2020-07-05 15:49:34 -07:00
257a49ce37 Remove CoreOS Container Linux image names from docs
* Remove coreos-stable, coreos-beta, and coreos-alpha channel
references from docs
* CoreOS Container Linux is end of life (see changelog)
2020-06-30 01:36:53 -07:00
df3f40bcce Allow using Flatcar Linux edge on Azure
* Set Kubelet cgroup driver to systemd when Flatcar Linux edge
is chosen

Note: Typhoon module status assumes use of the stable variant of
an OS channel/stream. Its possible to use earlier variants and
those are sometimes tested or developed against, but stable is
the recommendation
2020-06-30 01:35:29 -07:00
32886cfba1 Promote Fedora CoreOS on Google Cloud to stable status 2020-06-29 23:09:11 -07:00
0ba2c1a4da Fix terraform fmt in firewall rules 2020-06-29 23:04:54 -07:00
430d139a5b Remove os_image variable on Google Cloud Fedora CoreOS
* In v1.18.3, the `os_stream` variable was added to select
a Fedora CoreOS image stream (stable, testing, next) on
AWS and Google Cloud (which publish official streams)
* Remove `os_image` variable deprecated in v1.18.3. Manually
uploaded images are no longer needed
2020-06-29 22:57:11 -07:00
7c6ab21b94 Isolate each DigitalOcean cluster in its own VPC
* DigitalOcean introduced Virtual Private Cloud (VPC) support
to match other clouds and enhance the prior "private networking"
feature. Before, droplet's belonging to different clusters (but
residing in the same region) could reach one another (although
Typhoon firewall rules prohibit this). Now, droplets in a VPC
reside in their own network
* https://www.digitalocean.com/docs/networking/vpc/
* Create droplet instances in a VPC per cluster. This matches the
design of Typhoon AWS, Azure, and GCP.
* Require `terraform-provider-digitalocean` v1.16.0+ (action required)
* Output `vpc_id` for use with an attached DigitalOcean
loadbalancer
2020-06-28 23:25:30 -07:00
21178868db Revert "Update Prometheus from v2.19.1 to v2.19.2"
* Prometheus has not published the v1.19.2
* This reverts commit 81b6f54169.
2020-06-27 14:53:58 -07:00
9dcf35e393 Update recommended Terraform provider versions
* Sync Terraform provider plugin versions with those
used internally
2020-06-27 14:44:18 -07:00
81b6f54169 Update Prometheus from v2.19.1 to v2.19.2
* https://github.com/prometheus/prometheus/releases/tag/v2.19.2
2020-06-27 14:34:30 -07:00
7bce15975c Update Kubernetes from v1.18.4 to v1.18.5
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1185
2020-06-27 13:52:18 -07:00
1f83ae7dbb Update Calico from v3.14.1 to v3.15.0
* https://docs.projectcalico.org/v3.15/release-notes/
2020-06-26 02:40:12 -07:00
a10a1cee9f Update mkdocs-material from v5.3.0 to v5.3.3 2020-06-26 02:24:37 -07:00
a79ad34ba3 Update Grafana from v7.0.3 to v7.0.4
* https://github.com/grafana/grafana/releases/tag/v7.0.4
2020-06-26 02:06:38 -07:00
99a11442c7 Update Prometheus from v2.19.0 to v2.19.1
* https://github.com/prometheus/prometheus/releases/tag/v2.19.1
2020-06-26 02:01:58 -07:00
d27f367004 Update Cilium from v1.8.0-rc4 to v1.8.0
* https://github.com/cilium/cilium/releases/tag/v1.8.0
2020-06-22 22:26:49 -07:00
e9c8520359 Add experimental Cilium CNI provider
* Accept experimental CNI `networking` mode "cilium"
* Run Cilium v1.8.0-rc4 with overlay vxlan tunnels and a
minimal set of features. We're interested in:
  * IPAM: Divide pod_cidr into /24 subnets per node
  * CNI networking pod-to-pod, pod-to-external
  * BPF masquerade
  * NetworkPolicy as defined by Kubernetes (no L7 Policy)
* Continue using kube-proxy with Cilium probe mode
* Firewall changes:
  * Require UDP 8472 for vxlan (Linux kernel default) between nodes
  * Optional ICMP echo(8) between nodes for host reachability
    (health)
  * Optional TCP 4240 between nodes for endpoint reachability (health)

Known Issues:

* Containers with `hostPort` don't listen on all host addresses,
these workloads must use `hostNetwork` for now
https://github.com/cilium/cilium/issues/12116
* Erroneous warning on Fedora CoreOS
https://github.com/cilium/cilium/issues/10256

Note: This is experimental. It is not listed in docs and may be
changed or removed without a deprecation notice

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/192
* https://github.com/cilium/cilium/issues/12217
2020-06-21 20:41:53 -07:00
37f00a3882 Reduce Calcio MTU on Fedora CoreOS Azure
* Change the Calico VXLAN interface for MTU from 1450 to 1410
* VXLAN on Azure should support MTU 1450. However, there is
history where performance measures have shown that 1410 is
needed to have expected performance. Flatcar Linux has the
same MTU 1410 override and note
* FCOS 31.20200323.3.2 was known to perform fine with 1450, but
now in 31.20200517.3.0 the right value seems to be 1410
2020-06-19 00:24:56 -07:00
4cfafeaa07 Fix Kubelet starting before hostname set on FCOS AWS
* Fedora CoreOS `kubelet.service` can start before the hostname
is set. Kubelet reads the hostname to determine the node name to
register. If the hostname was read as localhost, Kubelet will
continue trying to register as localhost (problem)
* This race manifests as a node that appears NotReady, the Kubelet
is trying to register as localhost, while the host itself (by then)
has an AWS provided hostname. Restarting kubelet.service is a
manual fix so Kubelet re-reads the hostname
* This race could only be shown on AWS, not on Google Cloud or
Azure despite attempts. Bare-metal and DigitalOcean differ and
use hostname-override (e.g. afterburn) so they're not affected
* Wait for nodes to have a non-localhost hostname in the oneshot
that awaits /etc/resolve.conf. Typhoon has no valid cases for a
node hostname being localhost (not even single-node clusters)

Related Openshift: https://github.com/openshift/machine-config-operator/pull/1813
Close https://github.com/poseidon/typhoon/issues/765
2020-06-19 00:19:54 -07:00
90e23f5822 Rename controller node label and NoSchedule taint
* Remove node label `node.kubernetes.io/master` from controller nodes
* Use `node.kubernetes.io/controller` (present since v1.9.5,
[#160](https://github.com/poseidon/typhoon/pull/160)) to node select controllers
* Rename controller NoSchedule taint from `node-role.kubernetes.io/master` to
`node-role.kubernetes.io/controller`
* Tolerate the new taint name for workloads that may run on controller nodes
and stop tolerating `node-role.kubernetes.io/master` taint
2020-06-19 00:12:13 -07:00
6234147948 Update recommended Terraform provider versions
* Sync Terraform provider plugin versions with those
used internally
2020-06-18 01:03:37 -07:00
c25c59058c Update Kubernetes from v1.18.3 to v1.18.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1184
2020-06-17 19:53:19 -07:00
bc9b808d44 Update nginx-ingress from v0.32.0 to v0.33.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/controller-0.33.0
2020-06-16 18:44:40 -07:00
4b0203fdb2 Fix typo in DigitalOcean docs title 2020-06-16 18:33:56 -07:00
331566e1f7 Update mkdocs packages for website 2020-06-16 18:20:19 -07:00
04520e447c Update node-exporter from v1.0.0 to v1.0.1
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.1
2020-06-16 17:57:09 -07:00
413585681b Remove unused Kubelet lock-file and exit-on-lock-contention
* Kubelet `--lock-file` and `--exit-on-lock-contention` date
back to usage of bootkube and at one point running Kubelet
in a "self-hosted" style whereby an on-host Kubelet (rkt)
started pods, but then a Kubelet DaemonSet was scheduled
and able to take over (hence self-hosted). `lock-file` and
`exit-on-lock-contention` flags supported this pivot. The
pattern has been out of favor (in bootkube too) for years
because of dueling Kubelet complexity
* Typhoon runs Kubelet as a container via an on-host systemd
unit using podman (Fedora CoreOS) or rkt (Flatcar Linux). In
fact, Typhoon no longer uses bootkube or control plane pivot
(let alone Kubelet pivot) and uses static pods since v1.16.0
* https://github.com/poseidon/typhoon/pull/536
2020-06-12 00:06:41 -07:00
96711d7f17 Remove unused Kubelet cert / key Terraform state
* Generated Kubelet TLS certificate and key are not longer
used or distributed to machines since Kubelet TLS bootstrap
is used instead. Remove the certificate and key from state
2020-06-11 21:24:36 -07:00
c9059d3fe9 Update Prometheus from v2.19.0-rc.0 to v2.19.0
* https://github.com/prometheus/prometheus/releases/tag/v2.19.0
2020-06-09 23:05:03 -07:00
a287920169 Use strict mode for Container Linux Configs
* Enable terraform-provider-ct `strict` mode for parsing
Container Linux Configs and snippets
* Fix Container Linux Config systemd unit syntax `enable`
(old) to `enabled`
* Align with Fedora CoreOS which uses strict mode already
2020-06-09 23:00:36 -07:00
8dc170b9d9 Update security disclosure contact email
* Use security@psdn.io across github.com/poseidon projects
2020-06-08 12:37:09 -07:00
aed1a5f33d Fix Fedora CoreOS docs for selecting a stream
* Fedora CoreOS image `os_stream` stable, testing, and next
have been configurable since v1.18.3
* Remove mention of outdated `os_image` variable
2020-06-08 12:25:57 -07:00
31d02b0221 Update Prometheus from v2.18.1 to v2.19.0-rc.0
* https://github.com/prometheus/prometheus/releases/tag/v2.19.0-rc.0
2020-06-05 00:16:45 -07:00
8f875f80f5 Update Grafana from v7.0.1 to v7.0.3
* https://github.com/grafana/grafana/releases/tag/v7.0.2
* https://github.com/grafana/grafana/releases/tag/v7.0.3
2020-06-03 12:31:58 -07:00
16c0b9152b Update kube-state-metrics from v1.9.6 to v1.9.7
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.7
2020-06-03 11:35:10 -07:00
99dbce67a3 Tweak minor style elements of issue templates 2020-05-31 16:19:33 -07:00
20bfd69780 Change Kubelet container image publishing
* Build Kubelet container images internally and publish
to Quay and Dockerhub (new) as an alternative in case of
registry outage or breach
* Use our infra to provide single and multi-arch (default)
Kublet images for possible future use
* Docs: Show how to use alternative Kubelet images via
snippets and a systemd dropin (builds on #737)

Changes:

* Update docs with changes to Kubelet image building
* If you prefer to trust images built by Quay/Dockerhub,
automated image builds are still available with unique
tags (albeit with some limitations):
  * Quay automated builds are tagged `build-{short_sha}`
  (limit: only amd64)
  * Dockerhub automated builts are tagged `build-{tag}`
  and `build-master` (limit: only amd64, no shas)

Links:

* Kubelet: https://github.com/poseidon/kubelet
* Docs: https://typhoon.psdn.io/topics/security/#container-images
* Registries:
  * quay.io/poseidon/kubelet
  * docker.io/psdn/kubelet
2020-05-30 23:34:23 -07:00
ba44408b76 Update Calico from v3.14.0 to v3.14.1
* https://docs.projectcalico.org/v3.14/release-notes/
2020-05-30 22:08:37 -07:00
455175d9e6 Update the fallback issue template
* Even "blank" issues need to fill out the fallback
template
2020-05-28 00:06:59 -07:00
d45804b1f6 Update Github issue template to use drop-downs (#747)
* Create a stricter bug report template
* Highlight topics that are not accepted in issues: operation, support, debugging, advice, or Kubernetes concepts
* Add a section to strongly suggest bug reports link a PR or describe a solution. This may be able to weed out topics that aren't focused bug reports
2020-05-27 23:49:25 -07:00
907a96916f Update mkdocs-material from v5.2.0 to v5.2.2
* https://github.com/squidfunk/mkdocs-material/releases/tag/5.2.2
2020-05-27 21:49:40 -07:00
187bb17d39 Update Grafana from v7.0.0 to v7.0.1
* https://github.com/grafana/grafana/releases/tag/v7.0.1
2020-05-27 21:35:24 -07:00
abc31c3711 Update node-exporter from v1.0.0-rc.1 to v1.0.0
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.0
2020-05-27 21:33:03 -07:00
140 changed files with 4745 additions and 1588 deletions

View File

@ -1,33 +0,0 @@
<!-- Fill in either the 'Bug' or 'Feature Request' section -->
## Bug
### Environment
* Platform: aws, azure, bare-metal, google-cloud, digital-ocean
* OS: fedora-coreos, flatcar-linux
* Release: Typhoon version or Git SHA (reporting latest is **not** helpful)
* Terraform: `terraform version` (reporting latest is **not** helpful)
* Plugins: Provider plugin versions (reporting latest is **not** helpful)
### Problem
Describe the problem.
### Desired Behavior
Describe the goal.
### Steps to Reproduce
Provide clear steps to reproduce the issue unless already covered.
## Feature Request
### Feature
Describe the feature and what problem it solves.
### Tradeoffs
What are the pros and cons of this feature? How will it be exercised and maintained?

39
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,39 @@
---
name: Bug report
about: Report a bug to improve the project
title: ''
labels: ''
assignees: ''
---
<!-- READ: Issues are used to receive focused bug reports from users and to track planned future enhancements by the authors. Topics like cluster operation, support, debugging help, advice, and Kubernetes concepts are out of scope and should not use issues-->
**Description**
A clear and concise description of what the bug is.
**Steps to Reproduce**
Provide clear steps to reproduce the bug.
- [ ] Relevant error messages if appropriate (concise, not a dump of everything).
- [ ] Explored using a vanilla cluster from the [tutorials](https://typhoon.psdn.io/#documentation). Ruled out [customizations](https://typhoon.psdn.io/advanced/customization/).
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment**
* Platform: aws, azure, bare-metal, google-cloud, digital-ocean
* OS: fedora-coreos, flatcar-linux (include release version)
* Release: Typhoon version or Git SHA (reporting latest is **not** helpful)
* Terraform: `terraform version` (reporting latest is **not** helpful)
* Plugins: Provider plugin versions (reporting latest is **not** helpful)
**Possible Solution**
<!-- Most bug reports should have some inkling about solutions. Otherwise, your report may be less of a bug and more of a support request (see top).-->
Link to a PR or description.

5
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: Security
url: https://typhoon.psdn.io/topics/security/
about: Report security vulnerabilities

15
.github/issue_template.md vendored Normal file
View File

@ -0,0 +1,15 @@
<!-- READ: Issues are used to receive focused bug reports from users and to track planned future enhancements by the authors. Topics like cluster operation, support, debugging help, advice, and Kubernetes concepts are out of scope and should not use issues-->
## Enhancement
### Overview
One paragraph explanation of the enhancement.
### Motivation
Describe the motivation and what problem this solves.
### Tradeoffs
What are the pros and cons of this feature? How will it be exercised and maintained?

View File

@ -2,8 +2,185 @@
Notable changes between versions.
## Latest
* Kubernetes [v1.19.2](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#v1192)
* Update flannel from v0.12.0 to v0.13.0-rc2 ([#216](https://github.com/poseidon/terraform-render-bootstrap/pull/216))
* Update flannel-cni from v0.4.0 to v0.4.1
* Update CNI plugins from v0.8.6 to v0.8.7
### Addons
* Refresh Prometheus rules/alerts and Grafana dashboards ([#831](https://github.com/poseidon/typhoon/pull/831))
* Reduce apiserver metrics cardinality for non-core APIs ([#830](https://github.com/poseidon/typhoon/pull/830))
## v1.19.1
* Kubernetes [v1.19.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#v1191)
* Change control plane seccomp annotations to GA `seccompProfile` ([#822](https://github.com/poseidon/typhoon/pull/822))
* Update Cilium from v1.8.2 to [v1.8.3](https://github.com/cilium/cilium/releases/tag/v1.8.3)
* Promote Cilium from experimental to general availability ([#827](https://github.com/poseidon/typhoon/pull/827))
* Update Calico from v1.15.2 to [v1.15.3](https://github.com/projectcalico/calico/releases/tag/v3.15.3)
### Fedora CoreOS
* Update Fedora CoreOS Config version from v1.0.0 to v1.1.0
* Require any [snippets](https://typhoon.psdn.io/advanced/customization/#hosts) customizations to update to v1.1.0
### Addons
* Update IngressClass resources to `networking.k8s.io/v1` ([#824](https://github.com/poseidon/typhoon/pull/824))
* Update Prometheus from v2.20.0 to [v2.21.0](https://github.com/prometheus/prometheus/releases/tag/v2.21.0)
* Remove Kubernetes node name labelmap `relabel_config` from etcd, Kubelet, and CAdvisor scrape config ([#828](https://github.com/poseidon/typhoon/pull/828))
## v1.19.0
* Kubernetes [v1.19.0](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#v1190)
* Update etcd from v3.4.10 to [v3.4.12](https://github.com/etcd-io/etcd/releases/tag/v3.4.12)
* Update Calico from v3.15.1 to [v3.15.2](https://docs.projectcalico.org/v3.15/release-notes/)
### Fedora CoreOS
* Fix race condition during bootstrap of multi-controller clusters ([#808](https://github.com/poseidon/typhoon/pull/808))
* Fix SELinux label of bootstrap-secrets on non-bootstrap controllers
### Addons
* Introduce [fleetlock](https://github.com/poseidon/fleetlock) for Fedora CoreOS reboot coordination ([#814](https://github.com/poseidon/typhoon/pull/814))
* Update nginx-ingress from v0.34.1 to [v0.35.0](https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.35.0)
* Repository changed to `k8s.gcr.io/ingress-nginx/controller`
* Update Grafana from v7.1.3 to [v7.1.5](https://github.com/grafana/grafana/releases/tag/v7.1.5)
## v1.18.8
* Kubernetes [v1.18.8](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1188)
* Migrate from Terraform v0.12.x to v0.13.x ([#804](https://github.com/poseidon/typhoon/pull/804)) (**action required**)
* Recommend Terraform v0.13.x ([migration guide](https://typhoon.psdn.io/topics/maintenance/#terraform-versions))
* Support automatic install of poseidon's provider plugins ([poseidon/ct](https://registry.terraform.io/providers/poseidon/ct/latest), [poseidon/matchbox](https://registry.terraform.io/providers/poseidon/matchbox/latest))
* Require Terraform v0.12.26+ (migration compatibility)
* Require `terraform-provider-ct` v0.6.1
* Require `terraform-provider-matchbox` v0.4.1
* Update etcd from v3.4.9 to [v3.4.10](https://github.com/etcd-io/etcd/releases/tag/v3.4.10)
* Update CoreDNS from v1.6.7 to [v1.7.0](https://coredns.io/2020/06/15/coredns-1.7.0-release/)
* Update Cilium from v1.8.1 to [v1.8.2](https://github.com/cilium/cilium/releases/tag/v1.8.2)
* Update [coreos/flannel-cni](https://github.com/coreos/flannel-cni) to [poseidon/flannel-cni](https://github.com/poseidon/flannel-cni) ([#798](https://github.com/poseidon/typhoon/pull/798))
* Update CNI plugins and fix CVEs with Flannel CNI (non-default)
* Transition to a poseidon maintained container image
### AWS
* Allow `terraform-provider-aws` v3.0+ ([#803](https://github.com/poseidon/typhoon/pull/803))
* Recommend updating `terraform-provider-aws` to v3.0+
* Continue to allow v2.23+, no v3.x specific features are used
### DigitalOcean
* Require `terraform-provider-digitalocean` v1.21+ for Terraform v0.13.x (unenforced)
* Require `terraform-provider-digitalocean` v1.20+ for Terraform v0.12.x
### Fedora CoreOS
* Fix support for Flannel with Fedora CoreOS ([#795](https://github.com/poseidon/typhoon/pull/795))
* Configure `flannel.1` link to select its own MAC address to solve flannel
pod-to-pod traffic drops starting with default link changes in Fedora CoreOS
32.20200629.3.0 ([details](https://github.com/coreos/fedora-coreos-tracker/issues/574#issuecomment-665487296))
#### Addons
* Update Prometheus from v2.19.2 to [v2.20.0](https://github.com/prometheus/prometheus/releases/tag/v2.20.0)
* Update Grafana from v7.0.6 to [v7.1.3](https://github.com/grafana/grafana/releases/tag/v7.1.3)
## v1.18.6
* Kubernetes [v1.18.6](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1186)
* Update Calico from v3.15.0 to [v3.15.1](https://docs.projectcalico.org/v3.15/release-notes/)
* Update Cilium from v1.8.0 to [v1.8.1](https://github.com/cilium/cilium/releases/tag/v1.8.1)
#### Addons
* Update nginx-ingress from v0.33.0 to [v0.34.1](https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.34.1)
* [ingress-nginx](https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.34.0) will publish images only to gcr.io
* Update Prometheus from v2.19.1 to [v2.19.2](https://github.com/prometheus/prometheus/releases/tag/v2.19.2)
* Update Grafana from v7.0.4 to [v7.0.6](https://github.com/grafana/grafana/releases/tag/v7.0.6)
## v1.18.5
* Kubernetes [v1.18.5](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1185)
* Add Cilium v1.8.0 as a (experimental) CNI provider option ([#760](https://github.com/poseidon/typhoon/pull/760))
* Set `networking` to "cilium" to enable
* Update Calico from v3.14.1 to [v3.15.0](https://docs.projectcalico.org/v3.15/release-notes/)
#### DigitalOcean
* Isolate each cluster in an independent DigitalOcean VPC ([#776](https://github.com/poseidon/typhoon/pull/776))
* Create droplets in a VPC per cluster (matches Typhoon AWS, Azure, and GCP)
* Require `terraform-provider-digitalocean` v1.16.0+ (action required)
* Output `vpc_id` for use with an attached DigitalOcean [loadbalancer](https://github.com/poseidon/typhoon/blob/v1.18.5/docs/architecture/digitalocean.md#custom-load-balancer)
### Fedora CoreOS
#### Google Cloud
* Promote Fedora CoreOS to stable
* Remove `os_image` variable deprecated in v1.18.3 ([#777](https://github.com/poseidon/typhoon/pull/777))
* Use `os_stream` to select a Fedora CoreOS image stream
### Flatcar Linux
#### Azure
* Allow using Flatcar Linux Edge by setting `os_image` to "flatcar-edge" ([#778](https://github.com/poseidon/typhoon/pull/778))
#### Addons
* Update Prometheus from v2.19.0 to [v2.19.1](https://github.com/prometheus/prometheus/releases/tag/v2.19.1)
* Update Grafana from v7.0.3 to [v7.0.4](https://github.com/grafana/grafana/releases/tag/v7.0.4)
## v1.18.4
* Kubernetes [v1.18.4](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1184)
* Update Kubelet image publishing ([#749](https://github.com/poseidon/typhoon/pull/749))
* Build Kubelet images internally and publish to Quay and Dockerhub
* [quay.io/poseidon/kubelet](https://quay.io/repository/poseidon/kubelet) (official)
* [docker.io/psdn/kubelet](https://hub.docker.com/r/psdn/kubelet) (fallback)
* Continue offering automated image builds with an alternate tag strategy (see [docs](https://typhoon.psdn.io/topics/security/#container-images))
* [Document](https://typhoon.psdn.io/advanced/customization/#kubelet) use of alternate Kubelet images during registry incidents
* Update Calico from v3.14.0 to [v3.14.1](https://docs.projectcalico.org/v3.14/release-notes/)
* Fix [CVE-2020-13597](https://github.com/kubernetes/kubernetes/issues/91507)
* Rename controller NoSchedule taint from `node-role.kubernetes.io/master` to `node-role.kubernetes.io/controller` ([#764](https://github.com/poseidon/typhoon/pull/764))
* Tolerate the new taint name for workloads that may run on controller nodes
* Remove node label `node.kubernetes.io/master` from controller nodes ([#764](https://github.com/poseidon/typhoon/pull/764))
* Use `node.kubernetes.io/controller` (present since v1.9.5, [#160](https://github.com/poseidon/typhoon/pull/160)) to node select controllers
* Remove unused Kubelet `-lock-file` and `-exit-on-lock-contention` ([#758](https://github.com/poseidon/typhoon/pull/758))
### Fedora CoreOS
#### Azure
* Use `strict` Fedora CoreOS Config (FCC) snippet parsing ([#755](https://github.com/poseidon/typhoon/pull/755))
* Reduce Calico vxlan interface MTU to maintain performance ([#767](https://github.com/poseidon/typhoon/pull/766))
#### AWS
* Fix Kubelet service race with hostname update ([#766](https://github.com/poseidon/typhoon/pull/766))
* Wait for a hostname to avoid Kubelet trying to register as `localhost`
### Flatcar Linux
* Use `strict` Container Linux Config (CLC) snippet parsing ([#755](https://github.com/poseidon/typhoon/pull/755))
* Require `terraform-provider-ct` v0.4+, recommend v0.5+ (**action required**)
### Addons
* Update nginx-ingress from v0.32.0 to [v0.33.0](https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.33.0)
* Update Prometheus from v2.18.1 to [v2.19.0](https://github.com/prometheus/prometheus/releases/tag/v2.19.0)
* Update node-exporter from v1.0.0-rc.1 to [v1.0.1](https://github.com/prometheus/node_exporter/releases/tag/v1.0.1)
* Update kube-state-metrics from v1.9.6 to v1.9.7
* Update Grafana from v7.0.0 to v7.0.3
## v1.18.3
* Kubernetes [v1.18.3](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1183)
* Use Kubelet [TLS bootstrap](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/) with bootstrap token authentication ([#713](https://github.com/poseidon/typhoon/pull/713))
* Enable Node [Authorization](https://kubernetes.io/docs/reference/access-authn-authz/node/) and [NodeRestriction](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) to reduce authorization scope
* Renew Kubelet certificates every 72 hours
@ -43,7 +220,7 @@ Notable changes between versions.
#### Google
* Support Fedora CoreOS [image streams](https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/) ([#723](https://github.com/poseidon/typhoon/pull/722))
* Support Fedora CoreOS [image streams](https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/) ([#723](https://github.com/poseidon/typhoon/pull/723))
* Add `os_stream` variable to set the stream to `stable` (default), `testing`, or `next`
* Deprecate `os_image` variable. Manual image uploads are no longer needed

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/), SELinux enforcing
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [preemptible](https://typhoon.psdn.io/cl/google-cloud/#preemption) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, or other [addons](https://typhoon.psdn.io/addons/overview/)
@ -29,7 +29,7 @@ Typhoon is available for [Fedora CoreOS](https://getfedora.org/coreos/).
| Azure | Fedora CoreOS | [azure/fedora-coreos/kubernetes](azure/fedora-coreos/kubernetes) | alpha |
| Bare-Metal | Fedora CoreOS | [bare-metal/fedora-coreos/kubernetes](bare-metal/fedora-coreos/kubernetes) | beta |
| DigitalOcean | Fedora CoreOS | [digital-ocean/fedora-coreos/kubernetes](digital-ocean/fedora-coreos/kubernetes) | beta |
| Google Cloud | Fedora CoreOS | [google-cloud/fedora-coreos/kubernetes](google-cloud/fedora-coreos/kubernetes) | beta |
| Google Cloud | Fedora CoreOS | [google-cloud/fedora-coreos/kubernetes](google-cloud/fedora-coreos/kubernetes) | stable |
Typhoon is available for [Flatcar Linux](https://www.flatcar-linux.org/releases/).
@ -54,7 +54,7 @@ Define a Kubernetes cluster by using the Terraform module for your chosen platfo
```tf
module "yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/fedora-coreos/kubernetes?ref=v1.18.3"
source = "git::https://github.com/poseidon/typhoon//google-cloud/fedora-coreos/kubernetes?ref=v1.19.2"
# Google Cloud
cluster_name = "yavin"
@ -93,9 +93,9 @@ In 4-8 minutes (varies by platform), the cluster will be ready. This Google Clou
$ export KUBECONFIG=/home/user/.kube/configs/yavin-config
$ kubectl get nodes
NAME ROLES STATUS AGE VERSION
yavin-controller-0.c.example-com.internal <none> Ready 6m v1.18.3
yavin-worker-jrbf.c.example-com.internal <none> Ready 5m v1.18.3
yavin-worker-mzdm.c.example-com.internal <none> Ready 5m v1.18.3
yavin-controller-0.c.example-com.internal <none> Ready 6m v1.19.2
yavin-worker-jrbf.c.example-com.internal <none> Ready 5m v1.19.2
yavin-worker-mzdm.c.example-com.internal <none> Ready 5m v1.19.2
```
List the pods.

View File

@ -49,6 +49,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -72,7 +73,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(coredns_dns_request_count_total{instance=~\"$instance\"}[5m])) by (proto)",
"expr": "sum(rate(coredns_dns_requests_total{instance=~\"$instance\"}[5m])) by (proto)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{proto}}",
@ -140,6 +141,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -163,7 +165,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(coredns_dns_request_type_count_total{instance=~\"$instance\"}[5m])) by (type)",
"expr": "sum(rate(coredns_dns_requests_total{instance=~\"$instance\"}[5m])) by (type)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{type}}",
@ -231,6 +233,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -254,7 +257,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(coredns_dns_request_count_total{instance=~\"$instance\"}[5m])) by (zone)",
"expr": "sum(rate(coredns_dns_requests_total{instance=~\"$instance\"}[5m])) by (zone)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{zone}}",
@ -335,6 +338,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},
@ -440,6 +444,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},
@ -463,7 +468,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(coredns_dns_response_rcode_count_total{instance=~\"$instance\"}[5m])) by (rcode)",
"expr": "sum(rate(coredns_dns_responses_total{instance=~\"$instance\"}[5m])) by (rcode)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{rcode}}",
@ -544,6 +549,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},
@ -649,6 +655,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},
@ -767,6 +774,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},
@ -790,7 +798,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(coredns_cache_size{instance=~\"$instance\"}) by (type)",
"expr": "sum(coredns_cache_entries{instance=~\"$instance\"}) by (type)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{type}}",
@ -858,6 +866,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": false
},

View File

@ -11,7 +11,6 @@ data:
"editable": true,
"gnetId": null,
"hideControls": false,
"id": 6,
"links": [
],
@ -343,7 +342,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "etcd_debugging_mvcc_db_total_size_in_bytes{job=\"$cluster\"}",
"expr": "etcd_mvcc_db_total_size_in_bytes{job=\"$cluster\"}",
"hide": false,
"interval": "",
"intervalFactor": 2,

View File

@ -172,7 +172,7 @@ data:
"tableColumn": "",
"targets": [
{
"expr": "sum(kubelet_running_pod_count{cluster=\"$cluster\", job=\"kubelet\", instance=~\"$instance\"})",
"expr": "sum(kubelet_running_pods{cluster=\"$cluster\", job=\"kubelet\", instance=~\"$instance\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}}",
@ -256,7 +256,7 @@ data:
"tableColumn": "",
"targets": [
{
"expr": "sum(kubelet_running_container_count{cluster=\"$cluster\", job=\"kubelet\", instance=~\"$instance\"})",
"expr": "sum(kubelet_running_containers{cluster=\"$cluster\", job=\"kubelet\", instance=~\"$instance\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}}",
@ -565,6 +565,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -656,6 +657,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -760,6 +762,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -864,6 +867,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -962,6 +966,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1075,6 +1080,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1168,6 +1174,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1274,6 +1281,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1378,6 +1386,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1469,6 +1478,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1574,6 +1584,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1665,6 +1676,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1769,6 +1781,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1873,6 +1886,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1998,6 +2012,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -2021,7 +2036,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{cluster=\"$cluster\",job=\"kubelet\", instance=~\"$instance\"}[5m])) by (instance, verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{cluster=\"$cluster\",job=\"kubelet\", instance=~\"$instance\"}[5m])) by (instance, verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} {{verb}} {{url}}",
@ -2102,6 +2117,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2193,6 +2209,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2284,6 +2301,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2470,7 +2488,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Kubelet",
"uid": "3138fa155d5915769fbded898ac09fd9",
"version": 0
@ -2607,6 +2625,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2698,6 +2717,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -2802,6 +2822,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2893,6 +2914,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -2997,6 +3019,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3109,6 +3132,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3132,7 +3156,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-proxy\",instance=~\"$instance\",verb=\"POST\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-proxy\",instance=~\"$instance\",verb=\"POST\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -3213,6 +3237,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -3236,7 +3261,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-proxy\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-proxy\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -3317,6 +3342,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3408,6 +3434,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3499,6 +3526,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3659,7 +3687,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Proxy",
"uid": "632e265de029684c40b21cb76bca4f94",
"version": 0

View File

@ -31,6 +31,7 @@ data:
"fill": 1,
"format": "percentunit",
"id": 1,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -686,6 +687,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/85a562078cdf77779eaa1add43ccec1e/k8s-resources-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell_1",
"pattern": "Value #A",
@ -704,6 +706,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to workloads",
"linkUrl": "./d/a87fb0d919ec0ea5f6543124e16c42a5/k8s-resources-workloads-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell_1",
"pattern": "Value #B",
@ -722,6 +725,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -740,6 +744,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -758,6 +763,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -776,6 +782,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -794,6 +801,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #G",
@ -812,6 +820,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/85a562078cdf77779eaa1add43ccec1e/k8s-resources-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell",
"pattern": "namespace",
@ -839,7 +848,7 @@ data:
],
"targets": [
{
"expr": "count(mixin_pod_workload{cluster=\"$cluster\"}) by (namespace)",
"expr": "sum(kube_pod_owner{cluster=\"$cluster\"}) by (namespace)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -848,7 +857,7 @@ data:
"step": 10
},
{
"expr": "count(avg(mixin_pod_workload{cluster=\"$cluster\"}) by (workload, namespace)) by (namespace)",
"expr": "count(avg(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\"}) by (workload, namespace)) by (namespace)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -1105,6 +1114,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/85a562078cdf77779eaa1add43ccec1e/k8s-resources-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell_1",
"pattern": "Value #A",
@ -1123,6 +1133,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to workloads",
"linkUrl": "./d/a87fb0d919ec0ea5f6543124e16c42a5/k8s-resources-workloads-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell_1",
"pattern": "Value #B",
@ -1141,6 +1152,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -1159,6 +1171,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -1177,6 +1190,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -1195,6 +1209,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -1213,6 +1228,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #G",
@ -1231,6 +1247,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/85a562078cdf77779eaa1add43ccec1e/k8s-resources-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell",
"pattern": "namespace",
@ -1258,7 +1275,7 @@ data:
],
"targets": [
{
"expr": "count(mixin_pod_workload{cluster=\"$cluster\"}) by (namespace)",
"expr": "sum(kube_pod_owner{cluster=\"$cluster\"}) by (namespace)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -1267,7 +1284,7 @@ data:
"step": 10
},
{
"expr": "count(avg(mixin_pod_workload{cluster=\"$cluster\"}) by (workload, namespace)) by (namespace)",
"expr": "count(avg(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\"}) by (workload, namespace)) by (namespace)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -1384,6 +1401,7 @@ data:
"datasource": "$datasource",
"fill": 1,
"id": 11,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1426,6 +1444,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -1444,6 +1463,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -1462,6 +1482,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -1480,6 +1501,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -1498,6 +1520,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -1516,6 +1539,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -1534,6 +1558,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/85a562078cdf77779eaa1add43ccec1e/k8s-resources-namespace?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$__cell",
"pattern": "namespace",
@ -2472,33 +2497,6 @@ data:
"regex": "",
"type": "datasource"
},
{
"allValue": null,
"current": {
"text": "prod",
"value": "prod"
},
"datasource": "$datasource",
"hide": 2,
"includeAll": false,
"label": "cluster",
"multi": false,
"name": "cluster",
"options": [
],
"query": "label_values(node_cpu_seconds_total, cluster)",
"refresh": 1,
"regex": "",
"sort": 2,
"tagValuesQuery": "",
"tags": [
],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
@ -2557,7 +2555,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Cluster",
"uid": "efa86fd1d0c121a26444b636a3f509a8",
"version": 0
@ -2789,7 +2787,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\"}) / sum(kube_pod_container_resource_requests_memory_bytes{namespace=\"$namespace\"})",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\", image!=\"\"}) / sum(kube_pod_container_resource_requests_memory_bytes{namespace=\"$namespace\"})",
"format": "time_series",
"instant": true,
"intervalFactor": 2,
@ -2873,7 +2871,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\"}) / sum(kube_pod_container_resource_limits_memory_bytes{namespace=\"$namespace\"})",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\", image!=\"\"}) / sum(kube_pod_container_resource_limits_memory_bytes{namespace=\"$namespace\"})",
"format": "time_series",
"instant": true,
"intervalFactor": 2,
@ -3115,6 +3113,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -3133,6 +3132,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -3151,6 +3151,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -3169,6 +3170,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -3187,6 +3189,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -3205,6 +3208,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -3387,7 +3391,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}) by (pod)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}) by (pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3515,6 +3519,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -3533,6 +3538,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -3551,6 +3557,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -3569,6 +3576,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -3587,6 +3595,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -3605,6 +3614,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -3623,6 +3633,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #G",
@ -3641,6 +3652,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #H",
@ -3659,6 +3671,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -3686,7 +3699,7 @@ data:
],
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\"}) by (pod)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\", image!=\"\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -3704,7 +3717,7 @@ data:
"step": 10
},
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_requests_memory_bytes{namespace=\"$namespace\"}) by (pod)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\", image!=\"\"}) by (pod) / sum(kube_pod_container_resource_requests_memory_bytes{namespace=\"$namespace\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -3722,7 +3735,7 @@ data:
"step": 10
},
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes{namespace=\"$namespace\"}) by (pod)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\",container!=\"\", image!=\"\"}) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes{namespace=\"$namespace\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -3821,6 +3834,7 @@ data:
"datasource": "$datasource",
"fill": 1,
"id": 9,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -3863,6 +3877,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -3881,6 +3896,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -3899,6 +3915,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -3917,6 +3934,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -3935,6 +3953,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -3953,6 +3972,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -3971,6 +3991,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -4798,7 +4819,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Namespace (Pods)",
"uid": "85a562078cdf77779eaa1add43ccec1e",
"version": 0
@ -4861,7 +4882,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -4973,6 +4994,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -4991,6 +5013,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -5009,6 +5032,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -5027,6 +5051,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -5045,6 +5070,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -5063,6 +5089,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "pod",
@ -5090,7 +5117,7 @@ data:
],
"targets": [
{
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5099,7 +5126,7 @@ data:
"step": 10
},
{
"expr": "sum(kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5108,7 +5135,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=\"$node\"}) by (pod) / sum(kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=~\"$node\"}) by (pod) / sum(kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5117,7 +5144,7 @@ data:
"step": 10
},
{
"expr": "sum(kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5126,7 +5153,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=\"$node\"}) by (pod) / sum(kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", node=~\"$node\"}) by (pod) / sum(kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5226,7 +5253,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=\"$node\", container!=\"\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=~\"$node\", container!=\"\"}) by (pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -5338,6 +5365,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -5356,6 +5384,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -5374,6 +5403,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -5392,6 +5422,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -5410,6 +5441,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -5428,6 +5460,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -5446,6 +5479,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #G",
@ -5464,6 +5498,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #H",
@ -5482,6 +5517,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "pod",
@ -5509,7 +5545,7 @@ data:
],
"targets": [
{
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5518,7 +5554,7 @@ data:
"step": 10
},
{
"expr": "sum(kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5527,7 +5563,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_requests_memory_bytes{node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_requests_memory_bytes{node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5536,7 +5572,7 @@ data:
"step": 10
},
{
"expr": "sum(kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", node=\"$node\"}) by (pod)",
"expr": "sum(kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5545,7 +5581,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes{node=\"$node\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_working_set_bytes{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod) / sum(kube_pod_container_resource_limits_memory_bytes{node=~\"$node\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5554,7 +5590,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_memory_rss{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_rss{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5563,7 +5599,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_memory_cache{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_cache{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5572,7 +5608,7 @@ data:
"step": 10
},
{
"expr": "sum(node_namespace_pod_container:container_memory_swap{cluster=\"$cluster\", node=\"$node\",container!=\"\"}) by (pod)",
"expr": "sum(node_namespace_pod_container:container_memory_swap{cluster=\"$cluster\", node=~\"$node\",container!=\"\"}) by (pod)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -5691,7 +5727,7 @@ data:
"hide": 0,
"includeAll": false,
"label": null,
"multi": false,
"multi": true,
"name": "node",
"options": [
@ -5739,7 +5775,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Node (Pods)",
"uid": "200ac8fdbfbb74b39aff88118e4d1c2c",
"version": 0

View File

@ -189,7 +189,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(increase(container_cpu_cfs_throttled_periods_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", cluster=\"$cluster\"}[5m])) by (container) /sum(increase(container_cpu_cfs_periods_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", cluster=\"$cluster\"}[5m])) by (container)",
"expr": "sum(increase(container_cpu_cfs_throttled_periods_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\", cluster=\"$cluster\"}[5m])) by (container) /sum(increase(container_cpu_cfs_periods_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\", cluster=\"$cluster\"}[5m])) by (container)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{container}}",
@ -203,7 +203,7 @@ data:
"fill": true,
"line": true,
"op": "gt",
"value": 1,
"value": 0.80000000000000004,
"yaxis": "left"
}
],
@ -308,6 +308,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -326,6 +327,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -344,6 +346,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -362,6 +365,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -380,6 +384,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -398,6 +403,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "container",
@ -580,7 +586,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}) by (container)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\", image!=\"\"}) by (container)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{container}}",
@ -708,6 +714,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -726,6 +733,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -744,6 +752,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -762,6 +771,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -780,6 +790,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -798,6 +809,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -816,6 +828,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #G",
@ -834,6 +847,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #H",
@ -852,6 +866,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "container",
@ -879,7 +894,7 @@ data:
],
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}) by (container)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\", image!=\"\"}) by (container)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -897,7 +912,7 @@ data:
"step": 10
},
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\"}) by (container) / sum(kube_pod_container_resource_requests_memory_bytes{namespace=\"$namespace\", pod=\"$pod\"}) by (container)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", image!=\"\"}) by (container) / sum(kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\"}) by (container)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -915,7 +930,7 @@ data:
"step": 10
},
{
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\"}) by (container) / sum(kube_pod_container_resource_limits_memory_bytes{namespace=\"$namespace\", pod=\"$pod\"}) by (container)",
"expr": "sum(container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\", image!=\"\"}) by (container) / sum(kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\"}) by (container)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -1014,6 +1029,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 6,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1112,6 +1128,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 7,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1210,6 +1227,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 8,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1308,6 +1326,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 9,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1406,6 +1425,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 10,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1504,6 +1524,7 @@ data:
"datasource": "$datasource",
"fill": 10,
"id": 11,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -1724,7 +1745,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Pod",
"uid": "6581e46e4e5c7ba40a07646395ef7b23",
"version": 0
@ -1787,7 +1808,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -1899,6 +1920,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -1917,6 +1939,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -1935,6 +1958,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -1953,6 +1977,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -1971,6 +1996,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -1989,6 +2015,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -2016,7 +2043,7 @@ data:
],
"targets": [
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2025,7 +2052,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2034,7 +2061,7 @@ data:
"step": 10
},
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2043,7 +2070,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2052,7 +2079,7 @@ data:
"step": 10
},
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2152,7 +2179,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -2264,6 +2291,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -2282,6 +2310,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -2300,6 +2329,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -2318,6 +2348,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -2336,6 +2367,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -2354,6 +2386,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -2381,7 +2414,7 @@ data:
],
"targets": [
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2390,7 +2423,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2399,7 +2432,7 @@ data:
"step": 10
},
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2408,7 +2441,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2417,7 +2450,7 @@ data:
"step": 10
},
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n/sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\", workload_type=\"$type\"}\n) by (pod)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2489,6 +2522,7 @@ data:
"datasource": "$datasource",
"fill": 1,
"id": 5,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -2531,6 +2565,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -2549,6 +2584,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -2567,6 +2603,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -2585,6 +2622,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -2603,6 +2641,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -2621,6 +2660,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -2639,6 +2679,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/6581e46e4e5c7ba40a07646395ef7b23/k8s-resources-pod?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-pod=$__cell",
"pattern": "pod",
@ -2666,7 +2707,7 @@ data:
],
"targets": [
{
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2675,7 +2716,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2684,7 +2725,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2693,7 +2734,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2702,7 +2743,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2711,7 +2752,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -2811,7 +2852,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -2909,7 +2950,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3007,7 +3048,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(avg(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(avg(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3105,7 +3146,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(avg(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(avg(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3203,7 +3244,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3301,7 +3342,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3399,7 +3440,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3497,7 +3538,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\"$workload\", workload_type=\"$type\"}) by (pod))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{pod}}",
@ -3646,7 +3687,7 @@ data:
"options": [
],
"query": "label_values(mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\"}, workload)",
"query": "label_values(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\"}, workload)",
"refresh": 1,
"regex": "",
"sort": 1,
@ -3673,7 +3714,7 @@ data:
"options": [
],
"query": "label_values(mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\"}, workload_type)",
"query": "label_values(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload=\"$workload\"}, workload_type)",
"refresh": 1,
"regex": "",
"sort": 1,
@ -3716,7 +3757,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Workload",
"uid": "a164a7f0339f99e89cea5cb47e9be617",
"version": 0
@ -3798,7 +3839,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}} - {{workload_type}}",
@ -3926,6 +3967,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -3944,6 +3986,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -3962,6 +4005,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -3980,6 +4024,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -3998,6 +4043,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -4016,6 +4062,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -4034,6 +4081,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/a164a7f0339f99e89cea5cb47e9be617/k8s-resources-workload?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-workload=$__cell&var-type=$__cell_2",
"pattern": "workload",
@ -4052,6 +4100,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "workload_type",
@ -4079,7 +4128,7 @@ data:
],
"targets": [
{
"expr": "count(mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}) by (workload, workload_type)",
"expr": "count(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}) by (workload, workload_type)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4088,7 +4137,7 @@ data:
"step": 10
},
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4097,7 +4146,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4106,7 +4155,7 @@ data:
"step": 10
},
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_requests_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4115,7 +4164,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4124,7 +4173,7 @@ data:
"step": 10
},
{
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_limits_cpu_cores{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4243,7 +4292,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}} - {{workload_type}}",
@ -4371,6 +4420,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -4389,6 +4439,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -4407,6 +4458,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -4425,6 +4477,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -4443,6 +4496,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -4461,6 +4515,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -4479,6 +4534,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "./d/a164a7f0339f99e89cea5cb47e9be617/k8s-resources-workload?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-workload=$__cell&var-type=$__cell_2",
"pattern": "workload",
@ -4497,6 +4553,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "workload_type",
@ -4524,7 +4581,7 @@ data:
],
"targets": [
{
"expr": "count(mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}) by (workload, workload_type)",
"expr": "count(namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}) by (workload, workload_type)",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4533,7 +4590,7 @@ data:
"step": 10
},
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4542,7 +4599,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4551,7 +4608,7 @@ data:
"step": 10
},
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_requests_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4560,7 +4617,7 @@ data:
"step": 10
},
{
"expr": "sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4569,7 +4626,7 @@ data:
"step": 10
},
{
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"expr": "sum(\n container_memory_working_set_bytes{cluster=\"$cluster\", namespace=\"$namespace\", container!=\"\", image!=\"\"}\n * on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n/sum(\n kube_pod_container_resource_limits_memory_bytes{cluster=\"$cluster\", namespace=\"$namespace\"}\n* on(namespace,pod)\n group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=\"$namespace\", workload_type=\"$type\"}\n) by (workload, workload_type)\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4641,6 +4698,7 @@ data:
"datasource": "$datasource",
"fill": 1,
"id": 5,
"interval": "1m",
"legend": {
"avg": false,
"current": false,
@ -4683,6 +4741,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -4701,6 +4760,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -4719,6 +4779,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #C",
@ -4737,6 +4798,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #D",
@ -4755,6 +4817,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #E",
@ -4773,6 +4836,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #F",
@ -4791,6 +4855,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "Drill down to pods",
"linkUrl": "./d/a164a7f0339f99e89cea5cb47e9be617/k8s-resources-workload?var-datasource=$datasource&var-cluster=$cluster&var-namespace=$namespace&var-workload=$__cell&var-type=$type",
"pattern": "workload",
@ -4809,6 +4874,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "workload_type",
@ -4836,7 +4902,7 @@ data:
],
"targets": [
{
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4845,7 +4911,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4854,7 +4920,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4863,7 +4929,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4872,7 +4938,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4881,7 +4947,7 @@ data:
"step": 10
},
{
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload_type=\"$type\"}) by (workload))\n",
"format": "table",
"instant": true,
"intervalFactor": 2,
@ -4981,7 +5047,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5079,7 +5145,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5177,7 +5243,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(avg(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(avg(irate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5275,7 +5341,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(avg(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(avg(irate(container_network_transmit_bytes_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5373,7 +5439,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5471,7 +5537,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_packets_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5569,7 +5635,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_receive_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5667,7 +5733,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod) \ngroup_left(workload,workload_type) mixin_pod_workload{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"expr": "(sum(irate(container_network_transmit_packets_dropped_total{cluster=\"$cluster\", namespace=~\"$namespace\"}[$__interval])\n* on (namespace,pod)\ngroup_left(workload,workload_type) namespace_workload_pod:kube_pod_owner:relabel{cluster=\"$cluster\", namespace=~\"$namespace\", workload=~\".+\", workload_type=\"$type\"}) by (workload))\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{workload}}",
@ -5757,7 +5823,7 @@ data:
"value": "deployment"
},
"datasource": "$datasource",
"definition": "label_values(mixin_pod_workload{namespace=~\"$namespace\", workload=~\".+\"}, workload_type)",
"definition": "label_values(namespace_workload_pod:kube_pod_owner:relabel{namespace=~\"$namespace\", workload=~\".+\"}, workload_type)",
"hide": 0,
"includeAll": false,
"label": null,
@ -5766,7 +5832,7 @@ data:
"options": [
],
"query": "label_values(mixin_pod_workload{namespace=~\"$namespace\", workload=~\".+\"}, workload_type)",
"query": "label_values(namespace_workload_pod:kube_pod_owner:relabel{namespace=~\"$namespace\", workload=~\".+\"}, workload_type)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
@ -5864,7 +5930,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Compute Resources / Namespace (Workloads)",
"uid": "a87fb0d919ec0ea5f6543124e16c42a5",
"version": 0

View File

@ -20,6 +20,24 @@ data:
"id": null,
"links": [
],
"panels": [
{
"content": "The SLO (service level objective) and other metrics displayed on this dashboard are for informational purposes only.",
"datasource": null,
"description": "The SLO (service level objective) and other metrics displayed on this dashboard are for informational purposes only.",
"gridPos": {
"h": 2,
"w": 24,
"x": 0,
"y": 0
},
"id": 2,
"mode": "markdown",
"span": 12,
"title": "Notice",
"type": "text"
}
],
"refresh": "10s",
"rows": [
@ -37,7 +55,9 @@ data:
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"decimals": 3,
"description": "How many percent of requests (both read and write) in 30 days have been answered successfully and fast enough?",
"format": "percentunit",
"gauge": {
"maxValue": 100,
"minValue": 0,
@ -48,7 +68,7 @@ data:
"gridPos": {
},
"id": 2,
"id": 3,
"interval": null,
"links": [
@ -78,7 +98,7 @@ data:
"to": "null"
}
],
"span": 2,
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
@ -88,7 +108,7 @@ data:
"tableColumn": "",
"targets": [
{
"expr": "sum(up{job=\"apiserver\", cluster=\"$cluster\"})",
"expr": "apiserver_request:availability30d{verb=\"all\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
@ -96,7 +116,7 @@ data:
}
],
"thresholds": "",
"title": "Up",
"title": "Availability (30d) > 99.000%",
"tooltip": {
"shared": false
},
@ -109,7 +129,7 @@ data:
"value": "null"
}
],
"valueName": "min"
"valueName": "avg"
},
{
"aliasColors": {
@ -119,11 +139,13 @@ data:
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"fill": 1,
"decimals": 3,
"description": "How much error budget is left looking at our 0.990% availability gurantees?",
"fill": 10,
"gridPos": {
},
"id": 3,
"id": 4,
"legend": {
"alignAsTable": false,
"avg": false,
@ -132,6 +154,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -150,37 +173,16 @@ data:
],
"spaceLength": 10,
"span": 5,
"span": 8,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\", instance=~\"$instance\",code=~\"2..\", cluster=\"$cluster\"}[5m]))",
"expr": "100 * (apiserver_request:availability30d{verb=\"all\", cluster=\"$cluster\"} - 0.990000)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "2xx",
"legendFormat": "errorbudget",
"refId": "A"
},
{
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\", instance=~\"$instance\",code=~\"3..\", cluster=\"$cluster\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "3xx",
"refId": "B"
},
{
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\", instance=~\"$instance\",code=~\"4..\", cluster=\"$cluster\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "4xx",
"refId": "C"
},
{
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\", instance=~\"$instance\",code=~\"5..\", cluster=\"$cluster\"}[5m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "5xx",
"refId": "D"
}
],
"thresholds": [
@ -188,7 +190,7 @@ data:
],
"timeFrom": null,
"timeShift": null,
"title": "RPC Rate",
"title": "ErrorBudget (30d) > 99.000%",
"tooltip": {
"shared": false,
"sort": 0,
@ -206,7 +208,8 @@ data:
},
"yaxes": [
{
"format": "ops",
"decimals": 3,
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
@ -214,7 +217,215 @@ data:
"show": true
},
{
"format": "ops",
"decimals": 3,
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
},
{
"collapse": false,
"collapsed": false,
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"decimals": 3,
"description": "How many percent of read requests (LIST,GET) in 30 days have been answered successfully and fast enough?",
"format": "percentunit",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 5,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "apiserver_request:availability30d{verb=\"read\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Read Availability (30d)",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many read requests (LIST,GET) per second do the apiservers get by code?",
"fill": 10,
"gridPos": {
},
"id": 6,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/2../i",
"color": "#56A64B"
},
{
"alias": "/3../i",
"color": "#F2CC0C"
},
{
"alias": "/4../i",
"color": "#3274D9"
},
{
"alias": "/5../i",
"color": "#E02F44"
}
],
"spaceLength": 10,
"span": 3,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum by (code) (code_resource:apiserver_request_total:rate5m{verb=\"read\", cluster=\"$cluster\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ code }}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Read SLI - Requests",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "reqps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "reqps",
"label": null,
"logBase": 1,
"max": null,
@ -231,21 +442,23 @@ data:
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many percent of read requests (LIST,GET) per second are returned with errors (5xx)?",
"fill": 1,
"gridPos": {
},
"id": 4,
"id": 7,
"legend": {
"alignAsTable": true,
"alignAsTable": false,
"avg": false,
"current": true,
"current": false,
"max": false,
"min": false,
"rightSide": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": true
"values": false
},
"lines": true,
"linewidth": 1,
@ -262,15 +475,15 @@ data:
],
"spaceLength": 10,
"span": 5,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\", instance=~\"$instance\", verb!=\"WATCH\", cluster=\"$cluster\"}[5m])) by (verb, le))",
"expr": "sum by (resource) (code_resource:apiserver_request_total:rate5m{verb=\"read\",code=~\"5..\", cluster=\"$cluster\"}) / sum by (resource) (code_resource:apiserver_request_total:rate5m{verb=\"read\", cluster=\"$cluster\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}}",
"legendFormat": "{{ resource }}",
"refId": "A"
}
],
@ -279,7 +492,493 @@ data:
],
"timeFrom": null,
"timeShift": null,
"title": "Request duration 99th quantile",
"title": "Read SLI - Errors",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
}
]
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many seconds is the 99th percentile for reading (LIST|GET) a given resource?",
"fill": 1,
"gridPos": {
},
"id": 8,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{verb=\"read\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ resource }}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Read SLI - Duration",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
},
{
"collapse": false,
"collapsed": false,
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"decimals": 3,
"description": "How many percent of write requests (POST|PUT|PATCH|DELETE) in 30 days have been answered successfully and fast enough?",
"format": "percentunit",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 9,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "apiserver_request:availability30d{verb=\"write\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Write Availability (30d)",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many write requests (POST|PUT|PATCH|DELETE) per second do the apiservers get by code?",
"fill": 10,
"gridPos": {
},
"id": 10,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/2../i",
"color": "#56A64B"
},
{
"alias": "/3../i",
"color": "#F2CC0C"
},
{
"alias": "/4../i",
"color": "#3274D9"
},
{
"alias": "/5../i",
"color": "#E02F44"
}
],
"spaceLength": 10,
"span": 3,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum by (code) (code_resource:apiserver_request_total:rate5m{verb=\"write\", cluster=\"$cluster\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ code }}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Write SLI - Requests",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "reqps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "reqps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many percent of write requests (POST|PUT|PATCH|DELETE) per second are returned with errors (5xx)?",
"fill": 1,
"gridPos": {
},
"id": 11,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum by (resource) (code_resource:apiserver_request_total:rate5m{verb=\"write\",code=~\"5..\", cluster=\"$cluster\"}) / sum by (resource) (code_resource:apiserver_request_total:rate5m{verb=\"write\", cluster=\"$cluster\"})",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ resource }}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Write SLI - Errors",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "percentunit",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
}
]
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"description": "How many seconds is the 99th percentile for writing (POST|PUT|PATCH|DELETE) a given resource?",
"fill": 1,
"gridPos": {
},
"id": 12,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 3,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{verb=\"write\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{ resource }}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Write SLI - Duration",
"tooltip": {
"shared": false,
"sort": 0,
@ -339,7 +1038,7 @@ data:
"gridPos": {
},
"id": 5,
"id": 13,
"legend": {
"alignAsTable": false,
"avg": false,
@ -348,6 +1047,7 @@ data:
"min": false,
"rightSide": false,
"show": false,
"sideWidth": null,
"total": false,
"values": false
},
@ -430,7 +1130,7 @@ data:
"gridPos": {
},
"id": 6,
"id": 14,
"legend": {
"alignAsTable": false,
"avg": false,
@ -439,6 +1139,7 @@ data:
"min": false,
"rightSide": false,
"show": false,
"sideWidth": null,
"total": false,
"values": false
},
@ -521,7 +1222,7 @@ data:
"gridPos": {
},
"id": 7,
"id": 15,
"legend": {
"alignAsTable": true,
"avg": false,
@ -530,6 +1231,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -625,307 +1327,7 @@ data:
"gridPos": {
},
"id": 8,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "etcd_helper_cache_entry_total{job=\"apiserver\", instance=~\"$instance\", cluster=\"$cluster\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "ETCD Cache Entry Total",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
}
]
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"fill": 1,
"gridPos": {
},
"id": 9,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(etcd_helper_cache_hit_total{job=\"apiserver\",instance=~\"$instance\", cluster=\"$cluster\"}[5m])) by (instance)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} hit",
"refId": "A"
},
{
"expr": "sum(rate(etcd_helper_cache_miss_total{job=\"apiserver\",instance=~\"$instance\", cluster=\"$cluster\"}[5m])) by (instance)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} miss",
"refId": "B"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "ETCD Cache Hit/Miss Rate",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "ops",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "ops",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
}
]
},
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"fill": 1,
"gridPos": {
},
"id": 10,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"span": 4,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99,sum(rate(etcd_request_cache_get_duration_seconds_bucket{job=\"apiserver\",instance=~\"$instance\", cluster=\"$cluster\"}[5m])) by (instance, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} get",
"refId": "A"
},
{
"expr": "histogram_quantile(0.99,sum(rate(etcd_request_cache_add_duration_seconds_bucket{job=\"apiserver\",instance=~\"$instance\", cluster=\"$cluster\"}[5m])) by (instance, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{instance}} miss",
"refId": "B"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "ETCD Cache Duration 99th Quantile",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
},
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": 0,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
},
{
"collapse": false,
"collapsed": false,
"panels": [
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"fill": 1,
"gridPos": {
},
"id": 11,
"id": 16,
"legend": {
"alignAsTable": false,
"avg": false,
@ -934,6 +1336,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1016,7 +1419,7 @@ data:
"gridPos": {
},
"id": 12,
"id": 17,
"legend": {
"alignAsTable": false,
"avg": false,
@ -1025,6 +1428,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1107,7 +1511,7 @@ data:
"gridPos": {
},
"id": 13,
"id": 18,
"legend": {
"alignAsTable": false,
"avg": false,
@ -1116,6 +1520,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1222,20 +1627,19 @@ data:
{
"allValue": null,
"current": {
"text": "prod",
"value": "prod"
},
"datasource": "$datasource",
"hide": 2,
"includeAll": false,
"label": null,
"label": "cluster",
"multi": false,
"name": "cluster",
"options": [
],
"query": "label_values(apiserver_request_total, cluster)",
"refresh": 1,
"refresh": 2,
"regex": "",
"sort": 1,
"tagValuesQuery": "",
@ -1303,7 +1707,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / API server",
"uid": "09ec8aa1e996d6ffcd6817bbaff4db1b",
"version": 0
@ -1440,6 +1844,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1544,6 +1949,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1648,6 +2054,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1752,6 +2159,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1864,6 +2272,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1887,7 +2296,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-controller-manager\", instance=~\"$instance\", verb=\"POST\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-controller-manager\", instance=~\"$instance\", verb=\"POST\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -1968,6 +2377,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -1991,7 +2401,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-controller-manager\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-controller-manager\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -2072,6 +2482,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2163,6 +2574,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2254,6 +2666,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -2414,7 +2827,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Controller Manager",
"uid": "72e0e05bef5099e5f049b05fdc429ed4",
"version": 0
@ -2467,6 +2880,7 @@ data:
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -2662,6 +3076,7 @@ data:
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -2965,7 +3380,7 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Persistent Volumes",
"uid": "919b92a8e8041bd567af9edab12c840c",
"version": 0
@ -3102,6 +3517,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -3214,6 +3630,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -3339,6 +3756,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3451,6 +3869,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3474,7 +3893,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-scheduler\", instance=~\"$instance\", verb=\"POST\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-scheduler\", instance=~\"$instance\", verb=\"POST\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -3555,6 +3974,7 @@ data:
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"total": false,
"values": true
},
@ -3578,7 +3998,7 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_latency_seconds_bucket{job=\"kube-scheduler\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"expr": "histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job=\"kube-scheduler\", instance=~\"$instance\", verb=\"GET\"}[5m])) by (verb, url, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{verb}} {{url}}",
@ -3659,6 +4079,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3750,6 +4171,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -3841,6 +4263,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -4001,11 +4424,916 @@ data:
"30d"
]
},
"timezone": "",
"timezone": "UTC",
"title": "Kubernetes / Scheduler",
"uid": "2e6b6a3b4bddf1427b3a55aa1311c656",
"version": 0
}
statefulset.json: |-
{
"__inputs": [
],
"__requires": [
],
"annotations": {
"list": [
]
},
"editable": false,
"gnetId": null,
"graphTooltip": 0,
"hideControls": false,
"id": null,
"links": [
],
"refresh": "",
"rows": [
{
"collapse": false,
"collapsed": false,
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 2,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "cores",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"tableColumn": "",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=~\"$statefulset.*\"}[3m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "CPU",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 3,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "GB",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"tableColumn": "",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{job=\"kubernetes-cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=~\"$statefulset.*\"}) / 1024^3",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Memory",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 4,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "Bps",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 4,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"tableColumn": "",
"targets": [
{
"expr": "sum(rate(container_network_transmit_bytes_total{job=\"kubernetes-cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=~\"$statefulset.*\"}[3m])) + sum(rate(container_network_receive_bytes_total{cluster=\"$cluster\", namespace=\"$namespace\",pod=~\"$statefulset.*\"}[3m]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Network",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
},
{
"collapse": false,
"collapsed": false,
"height": "100px",
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 5,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "max(kube_statefulset_replicas{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", statefulset=\"$statefulset\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Desired Replicas",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 6,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "min(kube_statefulset_status_replicas_current{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", statefulset=\"$statefulset\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Replicas of current version",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 7,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "max(kube_statefulset_status_observed_generation{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", statefulset=\"$statefulset\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Observed Generation",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "$datasource",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
},
"id": 8,
"interval": null,
"links": [
],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 3,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "max(kube_statefulset_metadata_generation{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "",
"refId": "A"
}
],
"thresholds": "",
"title": "Metadata Generation",
"tooltip": {
"shared": false
},
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "0",
"value": "null"
}
],
"valueName": "current"
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
},
{
"collapse": false,
"collapsed": false,
"panels": [
{
"aliasColors": {
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "$datasource",
"fill": 1,
"gridPos": {
},
"id": 9,
"legend": {
"alignAsTable": false,
"avg": false,
"current": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [
],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "max(kube_statefulset_replicas{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "replicas specified",
"refId": "A"
},
{
"expr": "max(kube_statefulset_status_replicas{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "replicas created",
"refId": "B"
},
{
"expr": "min(kube_statefulset_status_replicas_ready{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "ready",
"refId": "C"
},
{
"expr": "min(kube_statefulset_status_replicas_current{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "replicas of current version",
"refId": "D"
},
{
"expr": "min(kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\", statefulset=\"$statefulset\", cluster=\"$cluster\", namespace=\"$namespace\"}) without (instance, pod)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "updated",
"refId": "E"
}
],
"thresholds": [
],
"timeFrom": null,
"timeShift": null,
"title": "Replicas",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": [
]
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6",
"type": "row"
}
],
"schemaVersion": 14,
"style": "dark",
"tags": [
"kubernetes-mixin"
],
"templating": {
"list": [
{
"current": {
"text": "default",
"value": "default"
},
"hide": 0,
"label": null,
"name": "datasource",
"options": [
],
"query": "prometheus",
"refresh": 1,
"regex": "",
"type": "datasource"
},
{
"allValue": null,
"current": {
},
"datasource": "$datasource",
"hide": 2,
"includeAll": false,
"label": "cluster",
"multi": false,
"name": "cluster",
"options": [
],
"query": "label_values(kube_statefulset_metadata_generation, cluster)",
"refresh": 2,
"regex": "",
"sort": 1,
"tagValuesQuery": "",
"tags": [
],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
},
"datasource": "$datasource",
"hide": 0,
"includeAll": false,
"label": "Namespace",
"multi": false,
"name": "namespace",
"options": [
],
"query": "label_values(kube_statefulset_metadata_generation{job=\"kube-state-metrics\", cluster=\"$cluster\"}, namespace)",
"refresh": 2,
"regex": "",
"sort": 1,
"tagValuesQuery": "",
"tags": [
],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
},
"datasource": "$datasource",
"hide": 0,
"includeAll": false,
"label": "Name",
"multi": false,
"name": "statefulset",
"options": [
],
"query": "label_values(kube_statefulset_metadata_generation{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\"}, statefulset)",
"refresh": 2,
"regex": "",
"sort": 1,
"tagValuesQuery": "",
"tags": [
],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "UTC",
"title": "Kubernetes / StatefulSets",
"uid": "a31c1f46e6f727cb37c0d731a7245005",
"version": 0
}
kind: ConfigMap
metadata:
name: grafana-dashboards-k8s

View File

@ -308,6 +308,7 @@ data:
"min": false,
"rightSide": "true",
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -399,6 +400,7 @@ data:
"min": false,
"rightSide": "true",
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -503,6 +505,7 @@ data:
"min": false,
"rightSide": "true",
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -621,6 +624,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -719,6 +723,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},
@ -810,6 +815,7 @@ data:
"min": false,
"rightSide": false,
"show": "true",
"sideWidth": null,
"total": false,
"values": "true"
},

View File

@ -48,6 +48,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -140,6 +141,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -265,6 +267,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -471,6 +474,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -586,6 +590,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -704,6 +709,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -796,6 +802,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},

View File

@ -48,6 +48,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -71,10 +72,10 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(\n prometheus_remote_storage_highest_timestamp_in_seconds{cluster=~\"$cluster\", instance=~\"$instance\"} \n- \n ignoring(queue) group_right(instance) prometheus_remote_storage_queue_highest_sent_timestamp_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}\n)\n",
"expr": "(\n prometheus_remote_storage_highest_timestamp_in_seconds{cluster=~\"$cluster\", instance=~\"$instance\"} \n- \n ignoring(remote_name, url) group_right(instance) prometheus_remote_storage_queue_highest_sent_timestamp_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}\n)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -139,6 +140,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -162,10 +164,10 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "(\n rate(prometheus_remote_storage_highest_timestamp_in_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}[5m]) \n- \n ignoring (queue) group_right(instance) rate(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n)\n",
"expr": "(\n rate(prometheus_remote_storage_highest_timestamp_in_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}[5m]) \n- \n ignoring (remote_name, url) group_right(instance) rate(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n)\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -243,6 +245,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -266,10 +269,10 @@ data:
"steppedLine": false,
"targets": [
{
"expr": "rate(\n prometheus_remote_storage_samples_in_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n- \n ignoring(queue) group_right(instance) rate(prometheus_remote_storage_succeeded_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m]) \n- \n rate(prometheus_remote_storage_dropped_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n",
"expr": "rate(\n prometheus_remote_storage_samples_in_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n- \n ignoring(remote_name, url) group_right(instance) rate(prometheus_remote_storage_succeeded_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n- \n rate(prometheus_remote_storage_dropped_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])\n",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -347,6 +350,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -374,7 +378,7 @@ data:
"expr": "prometheus_remote_storage_shards{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -439,6 +443,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -465,7 +470,7 @@ data:
"expr": "prometheus_remote_storage_shards_max{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -530,6 +535,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -556,7 +562,7 @@ data:
"expr": "prometheus_remote_storage_shards_min{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -621,6 +627,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -647,7 +654,7 @@ data:
"expr": "prometheus_remote_storage_shards_desired{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -725,6 +732,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -751,7 +759,7 @@ data:
"expr": "prometheus_remote_storage_shard_capacity{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -816,6 +824,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -842,7 +851,7 @@ data:
"expr": "prometheus_remote_storage_pending_samples{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -920,6 +929,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1011,6 +1021,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1037,7 +1048,7 @@ data:
"expr": "prometheus_wal_watcher_current_segment{cluster=~\"$cluster\", instance=~\"$instance\"}",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{consumer}}",
"refId": "A"
}
],
@ -1115,6 +1126,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1141,7 +1153,7 @@ data:
"expr": "rate(prometheus_remote_storage_dropped_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -1206,6 +1218,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1232,7 +1245,7 @@ data:
"expr": "rate(prometheus_remote_storage_failed_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -1297,6 +1310,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1323,7 +1337,7 @@ data:
"expr": "rate(prometheus_remote_storage_retried_samples_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -1388,6 +1402,7 @@ data:
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"total": false,
"values": false
},
@ -1414,7 +1429,7 @@ data:
"expr": "rate(prometheus_remote_storage_enqueue_retries_total{cluster=~\"$cluster\", instance=~\"$instance\"}[5m])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{cluster}}:{{instance}}-{{queue}}",
"legendFormat": "{{cluster}}:{{instance}} {{remote_name}}:{{url}}",
"refId": "A"
}
],
@ -1567,11 +1582,11 @@ data:
"includeAll": true,
"label": null,
"multi": false,
"name": "queue",
"name": "url",
"options": [
],
"query": "label_values(prometheus_remote_storage_shards{cluster=~\"$cluster\", instance=~\"$instance\"}, queue)",
"query": "label_values(prometheus_remote_storage_shards{cluster=~\"$cluster\", instance=~\"$instance\"}, url)",
"refresh": 2,
"regex": "",
"sort": 0,
@ -1690,6 +1705,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #A",
@ -1708,6 +1724,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "Value #B",
@ -1726,6 +1743,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "instance",
@ -1744,6 +1762,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "job",
@ -1762,6 +1781,7 @@ data:
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"linkTargetBlank": false,
"linkTooltip": "Drill down",
"linkUrl": "",
"pattern": "version",
@ -2814,7 +2834,7 @@ data:
]
},
"timezone": "utc",
"title": "Prometheus",
"title": "Prometheus Overview",
"uid": "",
"version": 0
}

View File

@ -18,12 +18,13 @@ spec:
labels:
name: grafana
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: grafana
image: docker.io/grafana/grafana:7.0.0
image: docker.io/grafana/grafana:7.1.5
env:
- name: GF_PATHS_CONFIG
value: "/etc/grafana/custom.ini"

View File

@ -1,4 +1,4 @@
apiVersion: networking.k8s.io/v1beta1
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: public

View File

@ -17,12 +17,13 @@ spec:
labels:
name: nginx-ingress-controller
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
image: k8s.gcr.io/ingress-nginx/controller:v0.35.0
args:
- /nginx-ingress-controller
- --ingress-class=public
@ -47,7 +48,6 @@ spec:
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
@ -55,15 +55,16 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
lifecycle:
preStop:

View File

@ -1,4 +1,4 @@
apiVersion: networking.k8s.io/v1beta1
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: public

View File

@ -17,12 +17,13 @@ spec:
labels:
name: nginx-ingress-controller
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
image: k8s.gcr.io/ingress-nginx/controller:v0.35.0
args:
- /nginx-ingress-controller
- --ingress-class=public
@ -47,7 +48,6 @@ spec:
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
@ -55,15 +55,16 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
lifecycle:
preStop:

View File

@ -1,4 +1,4 @@
apiVersion: networking.k8s.io/v1beta1
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: public

View File

@ -1,7 +1,7 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: ingress-controller-public
name: nginx-ingress-controller
namespace: ingress
spec:
replicas: 2
@ -10,19 +10,20 @@ spec:
maxUnavailable: 1
selector:
matchLabels:
name: ingress-controller-public
name: nginx-ingress-controller
phase: prod
template:
metadata:
labels:
name: ingress-controller-public
name: nginx-ingress-controller
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
image: k8s.gcr.io/ingress-nginx/controller:v0.35.0
args:
- /nginx-ingress-controller
- --ingress-class=public
@ -76,4 +77,3 @@ spec:
runAsUser: 101 # www-data
restartPolicy: Always
terminationGracePeriodSeconds: 300

View File

@ -1,4 +1,4 @@
apiVersion: networking.k8s.io/v1beta1
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: public

View File

@ -17,12 +17,13 @@ spec:
labels:
name: nginx-ingress-controller
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
image: k8s.gcr.io/ingress-nginx/controller:v0.35.0
args:
- /nginx-ingress-controller
- --ingress-class=public
@ -47,7 +48,6 @@ spec:
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
@ -55,15 +55,16 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
lifecycle:
preStop:

View File

@ -1,4 +1,4 @@
apiVersion: networking.k8s.io/v1beta1
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: public

View File

@ -17,12 +17,13 @@ spec:
labels:
name: nginx-ingress-controller
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.32.0
image: k8s.gcr.io/ingress-nginx/controller:v0.35.0
args:
- /nginx-ingress-controller
- --ingress-class=public
@ -47,7 +48,6 @@ spec:
containerPort: 10254
hostPort: 10254
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
@ -55,15 +55,16 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 5
lifecycle:
preStop:

View File

@ -34,7 +34,7 @@ data:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
@ -68,13 +68,16 @@ data:
- source_labels: [__name__, group]
regex: apiserver_request_duration_seconds_bucket;.+
action: drop
- source_labels: [__name__, group]
regex: apiserver_request_duration_seconds_count;.+
action: drop
# Scrape config for node (i.e. kubelet) /metrics (e.g. 'kubelet_'). Explore
# metrics from a node by scraping kubelet (127.0.0.1:10250/metrics).
- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
@ -82,10 +85,6 @@ data:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_name
# Scrape config for Kubelet cAdvisor. Explore metrics from a node by
# scraping kubelet (127.0.0.1:10250/metrics/cadvisor).
- job_name: 'kubernetes-cadvisor'
@ -100,9 +99,6 @@ data:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_name
metric_relabel_configs:
- source_labels: [__name__, image]
action: drop
@ -121,13 +117,11 @@ data:
- source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_controller]
action: keep
regex: 'true'
- action: labelmap
regex: __meta_kubernetes_node_name
- source_labels: [__meta_kubernetes_node_address_InternalIP]
action: replace
target_label: __address__
replacement: '${1}:2381'
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
@ -172,7 +166,7 @@ data:
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: job
metric_relabel_configs:
- source_labels: [__name__]
action: drop

View File

@ -14,13 +14,14 @@ spec:
labels:
name: prometheus
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
serviceAccountName: prometheus
containers:
- name: prometheus
image: quay.io/prometheus/prometheus:v2.18.1
image: quay.io/prometheus/prometheus:v2.21.0
args:
- --web.listen-address=0.0.0.0:9090
- --config.file=/etc/prometheus/prometheus.yaml

View File

@ -18,13 +18,14 @@ spec:
labels:
name: kube-state-metrics
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/coreos/kube-state-metrics:v1.9.6
image: quay.io/coreos/kube-state-metrics:v1.9.7
ports:
- name: metrics
containerPort: 8080

View File

@ -17,18 +17,18 @@ spec:
labels:
name: node-exporter
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
serviceAccountName: node-exporter
securityContext:
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:v1.0.0-rc.1
image: quay.io/prometheus/node-exporter:v1.0.1
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys

View File

@ -11,8 +11,8 @@ data:
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": members are down ({{ $value }})."
},
"expr": "max by (job) (\n sum by (job) (up{job=~\".*etcd.*\"} == bool 0)\nor\n count by (job,endpoint) (\n sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{job=~\".*etcd.*\"}[3m])) > 0.01\n )\n)\n> 0\n",
"for": "3m",
"expr": "max without (endpoint) (\n sum without (instance) (up{job=~\".*etcd.*\"} == bool 0)\nor\n count without (To) (\n sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~\".*etcd.*\"}[120s])) > 0.01\n )\n)\n> 0\n",
"for": "10m",
"labels": {
"severity": "critical"
}
@ -22,7 +22,7 @@ data:
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": insufficient members ({{ $value }})."
},
"expr": "sum(up{job=~\".*etcd.*\"} == bool 1) by (job) < ((count(up{job=~\".*etcd.*\"}) by (job) + 1) / 2)\n",
"expr": "sum(up{job=~\".*etcd.*\"} == bool 1) without (instance) < ((count(up{job=~\".*etcd.*\"}) without (instance) + 1) / 2)\n",
"for": "3m",
"labels": {
"severity": "critical"
@ -44,18 +44,40 @@ data:
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated."
},
"expr": "increase((max by (job) (etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}))[15m:1m]) >= 3\n",
"expr": "increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}))[15m:1m]) >= 4\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighNumberOfFailedGRPCRequests",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}."
},
"expr": "100 * sum(rate(grpc_server_handled_total{job=~\".*etcd.*\", grpc_code!=\"OK\"}[5m])) without (grpc_type, grpc_code)\n /\nsum(rate(grpc_server_handled_total{job=~\".*etcd.*\"}[5m])) without (grpc_type, grpc_code)\n > 1\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighNumberOfFailedGRPCRequests",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}."
},
"expr": "100 * sum(rate(grpc_server_handled_total{job=~\".*etcd.*\", grpc_code!=\"OK\"}[5m])) without (grpc_type, grpc_code)\n /\nsum(rate(grpc_server_handled_total{job=~\".*etcd.*\"}[5m])) without (grpc_type, grpc_code)\n > 5\n",
"for": "5m",
"labels": {
"severity": "critical"
}
},
{
"alert": "etcdGRPCRequestsSlow",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}."
},
"expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\", grpc_type=\"unary\"}[5m])) by (job, instance, grpc_service, grpc_method, le))\n> 0.15\n",
"expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\", grpc_type=\"unary\"}[5m])) without(grpc_type))\n> 0.15\n",
"for": "10m",
"labels": {
"severity": "critical"
@ -110,7 +132,7 @@ data:
"annotations": {
"message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}"
},
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.01\n",
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) without (code) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nwithout (code) > 0.01\n",
"for": "10m",
"labels": {
"severity": "warning"
@ -121,7 +143,7 @@ data:
"annotations": {
"message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}."
},
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.05\n",
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) without (code) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nwithout (code) > 0.05\n",
"for": "10m",
"labels": {
"severity": "critical"
@ -145,112 +167,137 @@ data:
kube.yaml: |-
{
"groups": [
{
"name": "kube-apiserver-error",
"rules": [
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[5m]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate5m"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[30m]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate30m"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[1h]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate1h"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[2h]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate2h"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[6h]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate6h"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[1d]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate1d"
},
{
"expr": "sum by (status_class) (\n label_replace(\n rate(apiserver_request_total{job=\"apiserver\"}[3d]\n ), \"status_class\", \"${1}xx\", \"code\", \"([0-9])..\")\n)\n",
"labels": {
"job": "apiserver"
},
"record": "status_class:apiserver_request_total:rate3d"
},
{
"expr": "sum(status_class:apiserver_request_total:rate5m{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate5m{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate5m"
},
{
"expr": "sum(status_class:apiserver_request_total:rate30m{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate30m{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate30m"
},
{
"expr": "sum(status_class:apiserver_request_total:rate1h{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate1h{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate1h"
},
{
"expr": "sum(status_class:apiserver_request_total:rate2h{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate2h{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate2h"
},
{
"expr": "sum(status_class:apiserver_request_total:rate6h{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate6h{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate6h"
},
{
"expr": "sum(status_class:apiserver_request_total:rate1d{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate1d{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate1d"
},
{
"expr": "sum(status_class:apiserver_request_total:rate3d{job=\"apiserver\",status_class=\"5xx\"})\n/\nsum(status_class:apiserver_request_total:rate3d{job=\"apiserver\"})\n",
"labels": {
"job": "apiserver"
},
"record": "status_class_5xx:apiserver_request_total:ratio_rate3d"
}
]
},
{
"name": "kube-apiserver.rules",
"rules": [
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[1d]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[1d]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[1d]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[1d]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[1d]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[1d]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate1d"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[1h]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[1h]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[1h]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[1h]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[1h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[1h]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate1h"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[2h]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[2h]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[2h]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[2h]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[2h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[2h]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate2h"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[30m]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[30m]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[30m]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[30m]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[30m]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[30m]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate30m"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[3d]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[3d]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[3d]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[3d]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[3d]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[3d]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate3d"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[5m]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[5m]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[5m]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[5m]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[5m]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[5m]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate5m"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[6h]))\n -\n (\n (\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[6h]))\n or\n vector(0)\n )\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[6h]))\n +\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[6h]))\n )\n )\n +\n # errors\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\",code=~\"5..\"}[6h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[6h]))\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:burnrate6h"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[1d]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[1d]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[1d]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[1d]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate1d"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[1h]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[1h]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[1h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[1h]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate1h"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[2h]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[2h]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[2h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[2h]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate2h"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[30m]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[30m]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[30m]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[30m]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate30m"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[3d]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[3d]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[3d]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[3d]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate3d"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[5m]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[5m]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[5m]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[5m]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate5m"
},
{
"expr": "(\n (\n # too slow\n sum(rate(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[6h]))\n -\n sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[6h]))\n )\n +\n sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\",code=~\"5..\"}[6h]))\n)\n/\nsum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[6h]))\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:burnrate6h"
},
{
"expr": "sum by (code,resource) (rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[5m]))\n",
"labels": {
"verb": "read"
},
"record": "code_resource:apiserver_request_total:rate5m"
},
{
"expr": "sum by (code,resource) (rate(apiserver_request_total{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[5m]))\n",
"labels": {
"verb": "write"
},
"record": "code_resource:apiserver_request_total:rate5m"
},
{
"expr": "histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\"}[5m]))) > 0\n",
"labels": {
"quantile": "0.99",
"verb": "read"
},
"record": "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile"
},
{
"expr": "histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"POST|PUT|PATCH|DELETE\"}[5m]))) > 0\n",
"labels": {
"quantile": "0.99",
"verb": "write"
},
"record": "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile"
},
{
"expr": "sum(rate(apiserver_request_duration_seconds_sum{subresource!=\"log\",verb!~\"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT\"}[5m])) without(instance, pod)\n/\nsum(rate(apiserver_request_duration_seconds_count{subresource!=\"log\",verb!~\"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT\"}[5m])) without(instance, pod)\n",
"record": "cluster:apiserver_request_duration_seconds:mean5m"
@ -278,6 +325,143 @@ data:
}
]
},
{
"interval": "3m",
"name": "kube-apiserver-availability.rules",
"rules": [
{
"expr": "1 - (\n (\n # write too slow\n sum(increase(apiserver_request_duration_seconds_count{verb=~\"POST|PUT|PATCH|DELETE\"}[30d]))\n -\n sum(increase(apiserver_request_duration_seconds_bucket{verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[30d]))\n ) +\n (\n # read too slow\n sum(increase(apiserver_request_duration_seconds_count{verb=~\"LIST|GET\"}[30d]))\n -\n (\n (\n sum(increase(apiserver_request_duration_seconds_bucket{verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[30d]))\n or\n vector(0)\n )\n +\n sum(increase(apiserver_request_duration_seconds_bucket{verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[30d]))\n +\n sum(increase(apiserver_request_duration_seconds_bucket{verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[30d]))\n )\n ) +\n # errors\n sum(code:apiserver_request_total:increase30d{code=~\"5..\"} or vector(0))\n)\n/\nsum(code:apiserver_request_total:increase30d)\n",
"labels": {
"verb": "all"
},
"record": "apiserver_request:availability30d"
},
{
"expr": "1 - (\n sum(increase(apiserver_request_duration_seconds_count{job=\"apiserver\",verb=~\"LIST|GET\"}[30d]))\n -\n (\n # too slow\n (\n sum(increase(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=~\"resource|\",le=\"0.1\"}[30d]))\n or\n vector(0)\n )\n +\n sum(increase(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"namespace\",le=\"0.5\"}[30d]))\n +\n sum(increase(apiserver_request_duration_seconds_bucket{job=\"apiserver\",verb=~\"LIST|GET\",scope=\"cluster\",le=\"5\"}[30d]))\n )\n +\n # errors\n sum(code:apiserver_request_total:increase30d{verb=\"read\",code=~\"5..\"} or vector(0))\n)\n/\nsum(code:apiserver_request_total:increase30d{verb=\"read\"})\n",
"labels": {
"verb": "read"
},
"record": "apiserver_request:availability30d"
},
{
"expr": "1 - (\n (\n # too slow\n sum(increase(apiserver_request_duration_seconds_count{verb=~\"POST|PUT|PATCH|DELETE\"}[30d]))\n -\n sum(increase(apiserver_request_duration_seconds_bucket{verb=~\"POST|PUT|PATCH|DELETE\",le=\"1\"}[30d]))\n )\n +\n # errors\n sum(code:apiserver_request_total:increase30d{verb=\"write\",code=~\"5..\"} or vector(0))\n)\n/\nsum(code:apiserver_request_total:increase30d{verb=\"write\"})\n",
"labels": {
"verb": "write"
},
"record": "apiserver_request:availability30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"LIST\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"GET\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"POST\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PUT\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PATCH\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"DELETE\",code=~\"2..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"LIST\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"GET\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"POST\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PUT\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PATCH\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"DELETE\",code=~\"3..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"LIST\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"GET\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"POST\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PUT\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PATCH\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"DELETE\",code=~\"4..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"LIST\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"GET\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"POST\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PUT\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"PATCH\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code, verb) (increase(apiserver_request_total{job=\"apiserver\",verb=\"DELETE\",code=~\"5..\"}[30d]))\n",
"record": "code_verb:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~\"LIST|GET\"})\n",
"labels": {
"verb": "read"
},
"record": "code:apiserver_request_total:increase30d"
},
{
"expr": "sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~\"POST|PUT|PATCH|DELETE\"})\n",
"labels": {
"verb": "write"
},
"record": "code:apiserver_request_total:increase30d"
}
]
},
{
"name": "k8s.rules",
"rules": [
@ -286,23 +470,23 @@ data:
"record": "namespace:container_cpu_usage_seconds_total:sum_rate"
},
{
"expr": "sum by (cluster, namespace, pod, container) (\n rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container!=\"POD\"}[5m])\n) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (\n 1, max by(cluster, namespace, pod, node) (kube_pod_info)\n)\n",
"expr": "sum by (cluster, namespace, pod, container) (\n rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container!=\"POD\"}[5m])\n) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (\n 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\"})\n)\n",
"record": "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate"
},
{
"expr": "container_memory_working_set_bytes{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info)\n)\n",
"expr": "container_memory_working_set_bytes{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info{node!=\"\"})\n)\n",
"record": "node_namespace_pod_container:container_memory_working_set_bytes"
},
{
"expr": "container_memory_rss{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info)\n)\n",
"expr": "container_memory_rss{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info{node!=\"\"})\n)\n",
"record": "node_namespace_pod_container:container_memory_rss"
},
{
"expr": "container_memory_cache{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info)\n)\n",
"expr": "container_memory_cache{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info{node!=\"\"})\n)\n",
"record": "node_namespace_pod_container:container_memory_cache"
},
{
"expr": "container_memory_swap{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info)\n)\n",
"expr": "container_memory_swap{job=\"kubernetes-cadvisor\", image!=\"\"}\n* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,\n max by(namespace, pod, node) (kube_pod_info{node!=\"\"})\n)\n",
"record": "node_namespace_pod_container:container_memory_swap"
},
{
@ -322,21 +506,21 @@ data:
"labels": {
"workload_type": "deployment"
},
"record": "mixin_pod_workload"
"record": "namespace_workload_pod:kube_pod_owner:relabel"
},
{
"expr": "max by (cluster, namespace, workload, pod) (\n label_replace(\n kube_pod_owner{job=\"kube-state-metrics\", owner_kind=\"DaemonSet\"},\n \"workload\", \"$1\", \"owner_name\", \"(.*)\"\n )\n)\n",
"labels": {
"workload_type": "daemonset"
},
"record": "mixin_pod_workload"
"record": "namespace_workload_pod:kube_pod_owner:relabel"
},
{
"expr": "max by (cluster, namespace, workload, pod) (\n label_replace(\n kube_pod_owner{job=\"kube-state-metrics\", owner_kind=\"StatefulSet\"},\n \"workload\", \"$1\", \"owner_name\", \"(.*)\"\n )\n)\n",
"labels": {
"workload_type": "statefulset"
},
"record": "mixin_pod_workload"
"record": "namespace_workload_pod:kube_pod_owner:relabel"
}
]
},
@ -412,11 +596,11 @@ data:
"name": "node.rules",
"rules": [
{
"expr": "sum(min(kube_pod_info) by (cluster, node))\n",
"expr": "sum(min(kube_pod_info{node!=\"\"}) by (cluster, node))\n",
"record": ":kube_pod_info_node_count:"
},
{
"expr": "topk by(namespace, pod) (1,\n max by (node, namespace, pod) (\n label_replace(kube_pod_info{job=\"kube-state-metrics\"}, \"pod\", \"$1\", \"pod\", \"(.*)\")\n))\n",
"expr": "topk by(namespace, pod) (1,\n max by (node, namespace, pod) (\n label_replace(kube_pod_info{job=\"kube-state-metrics\",node!=\"\"}, \"pod\", \"$1\", \"pod\", \"(.*)\")\n))\n",
"record": "node_namespace_pod:kube_pod_info:"
},
{
@ -461,104 +645,113 @@ data:
{
"alert": "KubePodCrashLooping",
"annotations": {
"message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} times / 5 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping"
"description": "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} times / 5 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping",
"summary": "Pod is crash looping."
},
"expr": "rate(kube_pod_container_status_restarts_total{job=\"kube-state-metrics\"}[15m]) * 60 * 5 > 0\n",
"expr": "rate(kube_pod_container_status_restarts_total{job=\"kube-state-metrics\"}[5m]) * 60 * 5 > 0\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubePodNotReady",
"annotations": {
"message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready"
"description": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready",
"summary": "Pod has been in a non-ready state for more than 15 minutes."
},
"expr": "sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job=\"kube-state-metrics\", phase=~\"Pending|Unknown\"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!=\"Job\"})) > 0\n",
"expr": "sum by (namespace, pod) (\n max by(namespace, pod) (\n kube_pod_status_phase{job=\"kube-state-metrics\", phase=~\"Pending|Unknown\"}\n ) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (\n 1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!=\"Job\"})\n )\n) > 0\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeDeploymentGenerationMismatch",
"annotations": {
"message": "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch"
"description": "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch",
"summary": "Deployment generation mismatch due to possible roll-back"
},
"expr": "kube_deployment_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_deployment_metadata_generation{job=\"kube-state-metrics\"}\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeDeploymentReplicasMismatch",
"annotations": {
"message": "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch"
"description": "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch",
"summary": "Deployment has not matched the expected number of replicas."
},
"expr": "(\n kube_deployment_spec_replicas{job=\"kube-state-metrics\"}\n !=\n kube_deployment_status_replicas_available{job=\"kube-state-metrics\"}\n) and (\n changes(kube_deployment_status_replicas_updated{job=\"kube-state-metrics\"}[5m])\n ==\n 0\n)\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeStatefulSetReplicasMismatch",
"annotations": {
"message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch"
"description": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch",
"summary": "Deployment has not matched the expected number of replicas."
},
"expr": "(\n kube_statefulset_status_replicas_ready{job=\"kube-state-metrics\"}\n !=\n kube_statefulset_status_replicas{job=\"kube-state-metrics\"}\n) and (\n changes(kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}[5m])\n ==\n 0\n)\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeStatefulSetGenerationMismatch",
"annotations": {
"message": "StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch"
"description": "StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch",
"summary": "StatefulSet generation mismatch due to possible roll-back"
},
"expr": "kube_statefulset_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_statefulset_metadata_generation{job=\"kube-state-metrics\"}\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeStatefulSetUpdateNotRolledOut",
"annotations": {
"message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout"
"description": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout",
"summary": "StatefulSet update has not been rolled out."
},
"expr": "max without (revision) (\n kube_statefulset_status_current_revision{job=\"kube-state-metrics\"}\n unless\n kube_statefulset_status_update_revision{job=\"kube-state-metrics\"}\n)\n *\n(\n kube_statefulset_replicas{job=\"kube-state-metrics\"}\n !=\n kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}\n)\n",
"expr": "(\n max without (revision) (\n kube_statefulset_status_current_revision{job=\"kube-state-metrics\"}\n unless\n kube_statefulset_status_update_revision{job=\"kube-state-metrics\"}\n )\n *\n (\n kube_statefulset_replicas{job=\"kube-state-metrics\"}\n !=\n kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}\n )\n) and (\n changes(kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}[5m])\n ==\n 0\n)\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeDaemonSetRolloutStuck",
"annotations": {
"message": "Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck"
"description": "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck",
"summary": "DaemonSet rollout is stuck."
},
"expr": "kube_daemonset_status_number_ready{job=\"kube-state-metrics\"}\n /\nkube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"} < 1.00\n",
"expr": "(\n (\n kube_daemonset_status_current_number_scheduled{job=\"kube-state-metrics\"}\n !=\n kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n ) or (\n kube_daemonset_status_number_misscheduled{job=\"kube-state-metrics\"}\n !=\n 0\n ) or (\n kube_daemonset_updated_number_scheduled{job=\"kube-state-metrics\"}\n !=\n kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n ) or (\n kube_daemonset_status_number_available{job=\"kube-state-metrics\"}\n !=\n kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n )\n) and (\n changes(kube_daemonset_updated_number_scheduled{job=\"kube-state-metrics\"}[5m])\n ==\n 0\n)\n",
"for": "15m",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubeContainerWaiting",
"annotations": {
"message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting"
"description": "Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting",
"summary": "Pod container waiting longer than 1 hour"
},
"expr": "sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job=\"kube-state-metrics\"}) > 0\n",
"for": "1h",
@ -569,8 +762,9 @@ data:
{
"alert": "KubeDaemonSetNotScheduled",
"annotations": {
"message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled"
"description": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled",
"summary": "DaemonSet pods are not scheduled."
},
"expr": "kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n -\nkube_daemonset_status_current_number_scheduled{job=\"kube-state-metrics\"} > 0\n",
"for": "10m",
@ -581,23 +775,12 @@ data:
{
"alert": "KubeDaemonSetMisScheduled",
"annotations": {
"message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled"
"description": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled",
"summary": "DaemonSet pods are misscheduled."
},
"expr": "kube_daemonset_status_number_misscheduled{job=\"kube-state-metrics\"} > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeCronJobRunning",
"annotations": {
"message": "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning"
},
"expr": "time() - kube_cronjob_next_schedule_time{job=\"kube-state-metrics\"} > 3600\n",
"for": "1h",
"for": "15m",
"labels": {
"severity": "warning"
}
@ -605,11 +788,12 @@ data:
{
"alert": "KubeJobCompletion",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion"
"description": "Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion",
"summary": "Job did not complete in time"
},
"expr": "kube_job_spec_completions{job=\"kube-state-metrics\"} - kube_job_status_succeeded{job=\"kube-state-metrics\"} > 0\n",
"for": "1h",
"for": "12h",
"labels": {
"severity": "warning"
}
@ -617,8 +801,9 @@ data:
{
"alert": "KubeJobFailed",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed"
"description": "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed",
"summary": "Job failed to complete."
},
"expr": "kube_job_failed{job=\"kube-state-metrics\"} > 0\n",
"for": "15m",
@ -629,8 +814,9 @@ data:
{
"alert": "KubeHpaReplicasMismatch",
"annotations": {
"message": "HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch"
"description": "HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch",
"summary": "HPA has not matched descired number of replicas."
},
"expr": "(kube_hpa_status_desired_replicas{job=\"kube-state-metrics\"}\n !=\nkube_hpa_status_current_replicas{job=\"kube-state-metrics\"})\n and\nchanges(kube_hpa_status_current_replicas[15m]) == 0\n",
"for": "15m",
@ -641,8 +827,9 @@ data:
{
"alert": "KubeHpaMaxedOut",
"annotations": {
"message": "HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout"
"description": "HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout",
"summary": "HPA is running at max replicas"
},
"expr": "kube_hpa_status_current_replicas{job=\"kube-state-metrics\"}\n ==\nkube_hpa_spec_max_replicas{job=\"kube-state-metrics\"}\n",
"for": "15m",
@ -658,8 +845,9 @@ data:
{
"alert": "KubeCPUOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
"description": "Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit",
"summary": "Cluster has overcommitted CPU resource requests."
},
"expr": "sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{})\n /\nsum(kube_node_status_allocatable_cpu_cores)\n >\n(count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)\n",
"for": "5m",
@ -668,10 +856,11 @@ data:
}
},
{
"alert": "KubeMemOvercommit",
"alert": "KubeMemoryOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
"description": "Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit",
"summary": "Cluster has overcommitted memory resource requests."
},
"expr": "sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{})\n /\nsum(kube_node_status_allocatable_memory_bytes)\n >\n(count(kube_node_status_allocatable_memory_bytes)-1)\n /\ncount(kube_node_status_allocatable_memory_bytes)\n",
"for": "5m",
@ -680,10 +869,11 @@ data:
}
},
{
"alert": "KubeCPUOvercommit",
"alert": "KubeCPUQuotaOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
"description": "Cluster has overcommitted CPU resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit",
"summary": "Cluster has overcommitted CPU resource requests."
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"cpu\"})\n /\nsum(kube_node_status_allocatable_cpu_cores)\n > 1.5\n",
"for": "5m",
@ -692,10 +882,11 @@ data:
}
},
{
"alert": "KubeMemOvercommit",
"alert": "KubeMemoryQuotaOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
"description": "Cluster has overcommitted memory resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryquotaovercommit",
"summary": "Cluster has overcommitted memory resource requests."
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"memory\"})\n /\nsum(kube_node_status_allocatable_memory_bytes{job=\"node-exporter\"})\n > 1.5\n",
"for": "5m",
@ -703,13 +894,40 @@ data:
"severity": "warning"
}
},
{
"alert": "KubeQuotaAlmostFull",
"annotations": {
"description": "Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaalmostfull",
"summary": "Namespace quota is going to be full."
},
"expr": "kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n > 0.9 < 1\n",
"for": "15m",
"labels": {
"severity": "info"
}
},
{
"alert": "KubeQuotaFullyUsed",
"annotations": {
"description": "Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotafullyused",
"summary": "Namespace quota is fully used."
},
"expr": "kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n == 1\n",
"for": "15m",
"labels": {
"severity": "info"
}
},
{
"alert": "KubeQuotaExceeded",
"annotations": {
"message": "Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded"
"description": "Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded",
"summary": "Namespace quota has exceeded the limits."
},
"expr": "kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n > 0.90\n",
"expr": "kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n > 1\n",
"for": "15m",
"labels": {
"severity": "warning"
@ -718,13 +936,14 @@ data:
{
"alert": "CPUThrottlingHigh",
"annotations": {
"message": "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh"
"description": "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh",
"summary": "Processes experience elevated CPU throttling."
},
"expr": "sum(increase(container_cpu_cfs_throttled_periods_total{container!=\"\", }[5m])) by (container, pod, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)\n > ( 100 / 100 )\n",
"expr": "sum(increase(container_cpu_cfs_throttled_periods_total{container!=\"\", }[5m])) by (container, pod, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)\n > ( 80 / 100 )\n",
"for": "15m",
"labels": {
"severity": "warning"
"severity": "info"
}
}
]
@ -733,10 +952,11 @@ data:
"name": "kubernetes-storage",
"rules": [
{
"alert": "KubePersistentVolumeUsageCritical",
"alert": "KubePersistentVolumeFillingUp",
"annotations": {
"message": "The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeusagecritical"
"description": "The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup",
"summary": "PersistentVolume is filling up."
},
"expr": "kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\nkubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n < 0.03\n",
"for": "1m",
@ -745,22 +965,24 @@ data:
}
},
{
"alert": "KubePersistentVolumeFullInFourDays",
"alert": "KubePersistentVolumeFillingUp",
"annotations": {
"message": "Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefullinfourdays"
"description": "Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup",
"summary": "PersistentVolume is filling up."
},
"expr": "(\n kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\n kubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n) < 0.15\nand\npredict_linear(kubelet_volume_stats_available_bytes{job=\"kubelet\"}[6h], 4 * 24 * 3600) < 0\n",
"for": "1h",
"labels": {
"severity": "critical"
"severity": "warning"
}
},
{
"alert": "KubePersistentVolumeErrors",
"annotations": {
"message": "The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors"
"description": "The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors",
"summary": "PersistentVolume is having issues with provisioning."
},
"expr": "kube_persistentvolume_status_phase{phase=~\"Failed|Pending\",job=\"kube-state-metrics\"} > 0\n",
"for": "5m",
@ -776,10 +998,11 @@ data:
{
"alert": "KubeVersionMismatch",
"annotations": {
"message": "There are {{ $value }} different semantic versions of Kubernetes components running.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch"
"description": "There are {{ $value }} different semantic versions of Kubernetes components running.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch",
"summary": "Different semantic versions of Kubernetes components running."
},
"expr": "count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~\"kube-dns|coredns\"},\"gitVersion\",\"$1\",\"gitVersion\",\"(v[0-9]*.[0-9]*.[0-9]*).*\"))) > 1\n",
"expr": "count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~\"kube-dns|coredns\"},\"gitVersion\",\"$1\",\"gitVersion\",\"(v[0-9]*.[0-9]*).*\"))) > 1\n",
"for": "15m",
"labels": {
"severity": "warning"
@ -788,8 +1011,9 @@ data:
{
"alert": "KubeClientErrors",
"annotations": {
"message": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors"
"description": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.'",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors",
"summary": "Kubernetes API server client is experiencing errors."
},
"expr": "(sum(rate(rest_client_requests_total{code=~\"5..\"}[5m])) by (instance, job)\n /\nsum(rate(rest_client_requests_total[5m])) by (instance, job))\n> 0.01\n",
"for": "15m",
@ -800,30 +1024,66 @@ data:
]
},
{
"name": "kube-apiserver-error-alerts",
"name": "kube-apiserver-slos",
"rules": [
{
"alert": "ErrorBudgetBurn",
"alert": "KubeAPIErrorBudgetBurn",
"annotations": {
"message": "High requests error budget burn for job=apiserver (current value: {{ $value }})",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-errorbudgetburn"
"description": "The API server is burning too much error budget.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn",
"summary": "The API server is burning too much error budget."
},
"expr": "(\n status_class_5xx:apiserver_request_total:ratio_rate1h{job=\"apiserver\"} > (14.4*0.010000)\n and\n status_class_5xx:apiserver_request_total:ratio_rate5m{job=\"apiserver\"} > (14.4*0.010000)\n)\nor\n(\n status_class_5xx:apiserver_request_total:ratio_rate6h{job=\"apiserver\"} > (6*0.010000)\n and\n status_class_5xx:apiserver_request_total:ratio_rate30m{job=\"apiserver\"} > (6*0.010000)\n)\n",
"expr": "sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)\nand\nsum(apiserver_request:burnrate5m) > (14.40 * 0.01000)\n",
"for": "2m",
"labels": {
"job": "apiserver",
"severity": "critical"
"long": "1h",
"severity": "critical",
"short": "5m"
}
},
{
"alert": "ErrorBudgetBurn",
"alert": "KubeAPIErrorBudgetBurn",
"annotations": {
"message": "High requests error budget burn for job=apiserver (current value: {{ $value }})",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-errorbudgetburn"
"description": "The API server is burning too much error budget.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn",
"summary": "The API server is burning too much error budget."
},
"expr": "(\n status_class_5xx:apiserver_request_total:ratio_rate1d{job=\"apiserver\"} > (3*0.010000)\n and\n status_class_5xx:apiserver_request_total:ratio_rate2h{job=\"apiserver\"} > (3*0.010000)\n)\nor\n(\n status_class_5xx:apiserver_request_total:ratio_rate3d{job=\"apiserver\"} > (0.010000)\n and\n status_class_5xx:apiserver_request_total:ratio_rate6h{job=\"apiserver\"} > (0.010000)\n)\n",
"expr": "sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)\nand\nsum(apiserver_request:burnrate30m) > (6.00 * 0.01000)\n",
"for": "15m",
"labels": {
"job": "apiserver",
"severity": "warning"
"long": "6h",
"severity": "critical",
"short": "30m"
}
},
{
"alert": "KubeAPIErrorBudgetBurn",
"annotations": {
"description": "The API server is burning too much error budget.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn",
"summary": "The API server is burning too much error budget."
},
"expr": "sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)\nand\nsum(apiserver_request:burnrate2h) > (3.00 * 0.01000)\n",
"for": "1h",
"labels": {
"long": "1d",
"severity": "warning",
"short": "2h"
}
},
{
"alert": "KubeAPIErrorBudgetBurn",
"annotations": {
"description": "The API server is burning too much error budget.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn",
"summary": "The API server is burning too much error budget."
},
"expr": "sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)\nand\nsum(apiserver_request:burnrate6h) > (1.00 * 0.01000)\n",
"for": "3h",
"labels": {
"long": "3d",
"severity": "warning",
"short": "6h"
}
}
]
@ -831,59 +1091,12 @@ data:
{
"name": "kubernetes-system-apiserver",
"rules": [
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has an abnormal latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "(\n cluster:apiserver_request_duration_seconds:mean5m{job=\"apiserver\"}\n >\n on (verb) group_left()\n (\n avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job=\"apiserver\"} >= 0)\n +\n 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job=\"apiserver\"} >= 0)\n )\n) > on (verb) group_left()\n1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job=\"apiserver\"} >= 0)\nand on (verb,resource)\ncluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job=\"apiserver\",quantile=\"0.99\"}\n>\n1\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job=\"apiserver\",quantile=\"0.99\"} > 4\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\",code=~\"5..\"}[5m])) by (resource,subresource,verb)\n /\nsum(rate(apiserver_request_total{job=\"apiserver\"}[5m])) by (resource,subresource,verb) > 0.10\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_total{job=\"apiserver\",code=~\"5..\"}[5m])) by (resource,subresource,verb)\n /\nsum(rate(apiserver_request_total{job=\"apiserver\"}[5m])) by (resource,subresource,verb) > 0.05\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 1.0 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
"description": "A client certificate used to authenticate to the apiserver is expiring in less than 1.0 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration",
"summary": "Client certificate is about to expire."
},
"expr": "apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 3600\n",
"labels": {
@ -893,8 +1106,9 @@ data:
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 0.1 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
"description": "A client certificate used to authenticate to the apiserver is expiring in less than 0.1 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration",
"summary": "Client certificate is about to expire."
},
"expr": "apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 300\n",
"labels": {
@ -904,8 +1118,9 @@ data:
{
"alert": "AggregatedAPIErrors",
"annotations": {
"message": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors"
"description": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors",
"summary": "An aggregated API has reported errors."
},
"expr": "sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2\n",
"labels": {
@ -915,10 +1130,11 @@ data:
{
"alert": "AggregatedAPIDown",
"annotations": {
"message": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} is down. It has not been available at least for the past five minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapidown"
"description": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has been only {{ $value | humanize }}% available over the last 10m.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapidown",
"summary": "An aggregated API is down."
},
"expr": "sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0\n",
"expr": "(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85\n",
"for": "5m",
"labels": {
"severity": "warning"
@ -927,8 +1143,9 @@ data:
{
"alert": "KubeAPIDown",
"annotations": {
"message": "KubeAPI has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown"
"description": "KubeAPI has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown",
"summary": "Target disappeared from Prometheus target discovery."
},
"expr": "absent(up{job=\"apiserver\"} == 1)\n",
"for": "15m",
@ -944,8 +1161,9 @@ data:
{
"alert": "KubeNodeNotReady",
"annotations": {
"message": "{{ $labels.node }} has been unready for more than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready"
"description": "{{ $labels.node }} has been unready for more than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready",
"summary": "Node is not ready."
},
"expr": "kube_node_status_condition{job=\"kube-state-metrics\",condition=\"Ready\",status=\"true\"} == 0\n",
"for": "15m",
@ -956,11 +1174,12 @@ data:
{
"alert": "KubeNodeUnreachable",
"annotations": {
"message": "{{ $labels.node }} is unreachable and some workloads may be rescheduled.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable"
"description": "{{ $labels.node }} is unreachable and some workloads may be rescheduled.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable",
"summary": "Node is unreachable."
},
"expr": "kube_node_spec_taint{job=\"kube-state-metrics\",key=\"node.kubernetes.io/unreachable\",effect=\"NoSchedule\"} == 1\n",
"for": "2m",
"expr": "(kube_node_spec_taint{job=\"kube-state-metrics\",key=\"node.kubernetes.io/unreachable\",effect=\"NoSchedule\"} unless ignoring(key,value) kube_node_spec_taint{job=\"kube-state-metrics\",key=~\"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn\"}) == 1\n",
"for": "15m",
"labels": {
"severity": "warning"
}
@ -968,10 +1187,11 @@ data:
{
"alert": "KubeletTooManyPods",
"annotations": {
"message": "Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods"
"description": "Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods",
"summary": "Kubelet is running at capacity."
},
"expr": "max(max(kubelet_running_pod_count{job=\"kubelet\"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\"}) by(node) / max(kube_node_status_capacity_pods{job=\"kube-state-metrics\"} != 1) by(node) > 0.95\n",
"expr": "count by(node) (\n (kube_pod_status_phase{job=\"kube-state-metrics\",phase=\"Running\"} == 1) * on(instance,pod,namespace,cluster) group_left(node) topk by(instance,pod,namespace,cluster) (1, kube_pod_info{job=\"kube-state-metrics\"})\n)\n/\nmax by(node) (\n kube_node_status_capacity_pods{job=\"kube-state-metrics\"} != 1\n) > 0.95\n",
"for": "15m",
"labels": {
"severity": "warning"
@ -980,8 +1200,9 @@ data:
{
"alert": "KubeNodeReadinessFlapping",
"annotations": {
"message": "The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping"
"description": "The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping",
"summary": "Node readiness status is flapping."
},
"expr": "sum(changes(kube_node_status_condition{status=\"true\",condition=\"Ready\"}[15m])) by (node) > 2\n",
"for": "15m",
@ -992,8 +1213,9 @@ data:
{
"alert": "KubeletPlegDurationHigh",
"annotations": {
"message": "The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh"
"description": "The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh",
"summary": "Kubelet Pod Lifecycle Event Generator is taking too long to relist."
},
"expr": "node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile=\"0.99\"} >= 10\n",
"for": "5m",
@ -1004,10 +1226,85 @@ data:
{
"alert": "KubeletPodStartUpLatencyHigh",
"annotations": {
"message": "Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh"
"description": "Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh",
"summary": "Kubelet Pod startup latency is too high."
},
"expr": "histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name > 60\n",
"expr": "histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job=\"kubelet\"} > 60\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletClientCertificateExpiration",
"annotations": {
"description": "Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration",
"summary": "Kubelet client certificate is about to expire."
},
"expr": "kubelet_certificate_manager_client_ttl_seconds < 3600\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletClientCertificateExpiration",
"annotations": {
"description": "Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration",
"summary": "Kubelet client certificate is about to expire."
},
"expr": "kubelet_certificate_manager_client_ttl_seconds < 300\n",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeletServerCertificateExpiration",
"annotations": {
"description": "Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration",
"summary": "Kubelet server certificate is about to expire."
},
"expr": "kubelet_certificate_manager_server_ttl_seconds < 3600\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletServerCertificateExpiration",
"annotations": {
"description": "Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration",
"summary": "Kubelet server certificate is about to expire."
},
"expr": "kubelet_certificate_manager_server_ttl_seconds < 300\n",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeletClientCertificateRenewalErrors",
"annotations": {
"description": "Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 5 minutes).",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificaterenewalerrors",
"summary": "Kubelet has failed to renew its client certificate."
},
"expr": "increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletServerCertificateRenewalErrors",
"annotations": {
"description": "Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes).",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificaterenewalerrors",
"summary": "Kubelet has failed to renew its server certificate."
},
"expr": "increase(kubelet_server_expiration_renew_errors[5m]) > 0\n",
"for": "15m",
"labels": {
"severity": "warning"
@ -1016,8 +1313,9 @@ data:
{
"alert": "KubeletDown",
"annotations": {
"message": "Kubelet has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown"
"description": "Kubelet has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown",
"summary": "Target disappeared from Prometheus target discovery."
},
"expr": "absent(up{job=\"kubelet\"} == 1)\n",
"for": "15m",
@ -1033,8 +1331,9 @@ data:
{
"alert": "KubeSchedulerDown",
"annotations": {
"message": "KubeScheduler has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown"
"description": "KubeScheduler has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown",
"summary": "Target disappeared from Prometheus target discovery."
},
"expr": "absent(up{job=\"kube-scheduler\"} == 1)\n",
"for": "15m",
@ -1050,8 +1349,9 @@ data:
{
"alert": "KubeControllerManagerDown",
"annotations": {
"message": "KubeControllerManager has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown"
"description": "KubeControllerManager has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown",
"summary": "Target disappeared from Prometheus target discovery."
},
"expr": "absent(up{job=\"kube-controller-manager\"} == 1)\n",
"for": "15m",
@ -1350,14 +1650,25 @@ data:
{
"alert": "NodeHighNumberConntrackEntriesUsed",
"annotations": {
"description": "{{ $value | humanizePercentage }} of conntrack entries are used",
"summary": "Number of conntrack are getting close to the limit"
"description": "{{ $value | humanizePercentage }} of conntrack entries are used.",
"summary": "Number of conntrack are getting close to the limit."
},
"expr": "(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "NodeTextFileCollectorScrapeError",
"annotations": {
"description": "Node Exporter text file collector failed to scrape.",
"summary": "Node Exporter text file collector failed to scrape."
},
"expr": "node_textfile_scrape_error{job=\"node-exporter\"} == 1\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "NodeClockSkewDetected",
"annotations": {
@ -1381,6 +1692,29 @@ data:
"labels": {
"severity": "warning"
}
},
{
"alert": "NodeRAIDDegraded",
"annotations": {
"description": "RAID array '{{ $labels.device }}' on {{ $labels.instance }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.",
"summary": "RAID Array is degraded"
},
"expr": "node_md_disks_required - ignoring (state) (node_md_disks{state=\"active\"}) > 0\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "NodeRAIDDiskFailure",
"annotations": {
"description": "At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.device }}' needs attention and possibly a disk swap.",
"summary": "Failed device in RAID array"
},
"expr": "node_md_disks{state=\"fail\"} > 0\n",
"labels": {
"severity": "warning"
}
}
]
}
@ -1515,7 +1849,7 @@ data:
{
"alert": "PrometheusRemoteStorageFailures",
"annotations": {
"description": "Prometheus {{$labels.instance}} failed to send {{ printf \"%.1f\" $value }}% of the samples to {{ if $labels.queue }}{{ $labels.queue }}{{ else }}{{ $labels.url }}{{ end }}.",
"description": "Prometheus {{$labels.instance}} failed to send {{ printf \"%.1f\" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }}",
"summary": "Prometheus fails to send samples to remote storage."
},
"expr": "(\n rate(prometheus_remote_storage_failed_samples_total{job=\"prometheus\"}[5m])\n/\n (\n rate(prometheus_remote_storage_failed_samples_total{job=\"prometheus\"}[5m])\n +\n rate(prometheus_remote_storage_succeeded_samples_total{job=\"prometheus\"}[5m])\n )\n)\n* 100\n> 1\n",
@ -1527,7 +1861,7 @@ data:
{
"alert": "PrometheusRemoteWriteBehind",
"annotations": {
"description": "Prometheus {{$labels.instance}} remote write is {{ printf \"%.1f\" $value }}s behind for {{ if $labels.queue }}{{ $labels.queue }}{{ else }}{{ $labels.url }}{{ end }}.",
"description": "Prometheus {{$labels.instance}} remote write is {{ printf \"%.1f\" $value }}s behind for {{ $labels.remote_name}}:{{ $labels.url }}.",
"summary": "Prometheus remote write is behind."
},
"expr": "# Without max_over_time, failed scrapes could create false negatives, see\n# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.\n(\n max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job=\"prometheus\"}[5m])\n- on(job, instance) group_right\n max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job=\"prometheus\"}[5m])\n)\n> 120\n",
@ -1539,7 +1873,7 @@ data:
{
"alert": "PrometheusRemoteWriteDesiredShards",
"annotations": {
"description": "Prometheus {{$labels.instance}} remote write desired shards calculation wants to run {{ $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance=\"%s\",job=\"prometheus\"}` $labels.instance | query | first | value }}.",
"description": "Prometheus {{$labels.instance}} remote write desired shards calculation wants to run {{ $value }} shards for queue {{ $labels.remote_name}}:{{ $labels.url }}, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance=\"%s\",job=\"prometheus\"}` $labels.instance | query | first | value }}.",
"summary": "Prometheus remote write desired shards calculation wants to run more than configured max shards."
},
"expr": "# Without max_over_time, failed scrapes could create false negatives, see\n# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.\n(\n max_over_time(prometheus_remote_storage_shards_desired{job=\"prometheus\"}[5m])\n>\n max_over_time(prometheus_remote_storage_shards_max{job=\"prometheus\"}[5m])\n)\n",
@ -1571,6 +1905,18 @@ data:
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTargetLimitHit",
"annotations": {
"description": "Prometheus {{$labels.instance}} has dropped {{ printf \"%.0f\" $value }} targets because the number of targets exceeded the configured target_limit.",
"summary": "Prometheus has dropped targets because some scrape configs have exceeded the targets limit."
},
"expr": "increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job=\"prometheus\"}[5m]) > 0\n",
"for": "15m",
"labels": {
"severity": "warning"
}
}
]
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [spot](https://typhoon.psdn.io/cl/aws/#spot) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]

View File

@ -2,12 +2,12 @@
systemd:
units:
- name: etcd-member.service
enable: true
enabled: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.4.9"
Environment="ETCD_IMAGE_TAG=v3.4.12"
Environment="ETCD_IMAGE_URL=docker://quay.io/coreos/etcd"
Environment="RKT_RUN_ARGS=--insecure-options=image"
Environment="ETCD_NAME=${etcd_name}"
@ -28,11 +28,11 @@ systemd:
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -46,13 +46,13 @@ systemd:
RequiredBy=kubelet.service
RequiredBy=etcd-member.service
- name: kubelet.service
enable: true
enabled: true
contents: |
[Unit]
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -102,16 +102,13 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -137,7 +134,7 @@ systemd:
--volume script,kind=host,source=/opt/bootstrap/apply \
--mount volume=script,target=/apply \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/apply
@ -145,6 +142,11 @@ systemd:
[Install]
WantedBy=multi-user.target
storage:
directories:
- path: /var/lib/etcd
filesystem: root
mode: 0700
overwrite: true
files:
- path: /etc/kubernetes/kubeconfig
filesystem: root
@ -166,6 +168,7 @@ storage:
mv tls/etcd/etcd-client* /etc/kubernetes/bootstrap-secrets/
chown -R etcd:etcd /etc/ssl/etcd
chmod -R 500 /etc/ssl/etcd
chmod -R 700 /var/lib/etcd
mv auth/kubeconfig /etc/kubernetes/bootstrap-secrets/
mv tls/k8s/* /etc/kubernetes/bootstrap-secrets/
mkdir -p /etc/kubernetes/manifests
@ -191,6 +194,7 @@ storage:
done
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184

View File

@ -49,10 +49,10 @@ resource "aws_instance" "controllers" {
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
pretty_print = false
snippets = var.controller_snippets
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
strict = true
snippets = var.controller_snippets
}
# Controller Container Linux configs

View File

@ -13,6 +13,30 @@ resource "aws_security_group" "controller" {
}
}
resource "aws_security_group_rule" "controller-icmp" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-icmp-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
self = true
}
resource "aws_security_group_rule" "controller-ssh" {
security_group_id = aws_security_group.controller.id
@ -44,39 +68,31 @@ resource "aws_security_group_rule" "controller-etcd-metrics" {
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-proxy
resource "aws_security_group_rule" "kube-proxy-metrics" {
resource "aws_security_group_rule" "controller-cilium-health" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10249
to_port = 10249
from_port = 4240
to_port = 4240
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-scheduler
resource "aws_security_group_rule" "controller-scheduler-metrics" {
resource "aws_security_group_rule" "controller-cilium-health-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10251
to_port = 10251
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-controller-manager
resource "aws_security_group_rule" "controller-manager-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10252
to_port = 10252
source_security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
self = true
}
# IANA VXLAN default
resource "aws_security_group_rule" "controller-vxlan" {
count = var.networking == "flannel" ? 1 : 0
@ -111,6 +127,31 @@ resource "aws_security_group_rule" "controller-apiserver" {
cidr_blocks = ["0.0.0.0/0"]
}
# Linux VXLAN default
resource "aws_security_group_rule" "controller-linux-vxlan" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-linux-vxlan-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
# Allow Prometheus to scrape node-exporter daemonset
resource "aws_security_group_rule" "controller-node-exporter" {
security_group_id = aws_security_group.controller.id
@ -122,6 +163,17 @@ resource "aws_security_group_rule" "controller-node-exporter" {
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-proxy
resource "aws_security_group_rule" "kube-proxy-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10249
to_port = 10249
source_security_group_id = aws_security_group.worker.id
}
# Allow apiserver to access kubelets for exec, log, port-forward
resource "aws_security_group_rule" "controller-kubelet" {
security_group_id = aws_security_group.controller.id
@ -143,6 +195,28 @@ resource "aws_security_group_rule" "controller-kubelet-self" {
self = true
}
# Allow Prometheus to scrape kube-scheduler
resource "aws_security_group_rule" "controller-scheduler-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10251
to_port = 10251
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-controller-manager
resource "aws_security_group_rule" "controller-manager-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10252
to_port = 10252
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-bgp" {
security_group_id = aws_security_group.controller.id
@ -227,6 +301,30 @@ resource "aws_security_group" "worker" {
}
}
resource "aws_security_group_rule" "worker-icmp" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-icmp-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
self = true
}
resource "aws_security_group_rule" "worker-ssh" {
security_group_id = aws_security_group.worker.id
@ -257,6 +355,31 @@ resource "aws_security_group_rule" "worker-https" {
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-cilium-health" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-cilium-health-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
self = true
}
# IANA VXLAN default
resource "aws_security_group_rule" "worker-vxlan" {
count = var.networking == "flannel" ? 1 : 0
@ -281,6 +404,31 @@ resource "aws_security_group_rule" "worker-vxlan-self" {
self = true
}
# Linux VXLAN default
resource "aws_security_group_rule" "worker-linux-vxlan" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-linux-vxlan-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
# Allow Prometheus to scrape node-exporter daemonset
resource "aws_security_group_rule" "worker-node-exporter" {
security_group_id = aws_security_group.worker.id

View File

@ -1,11 +1,15 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
aws = "~> 2.23"
ct = "~> 0.3"
aws = ">= 2.23, <= 4.0"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -2,11 +2,11 @@
systemd:
units:
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -19,13 +19,13 @@ systemd:
[Install]
RequiredBy=kubelet.service
- name: kubelet.service
enable: true
enabled: true
contents: |
[Unit]
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -75,10 +75,8 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in split(",", node_labels) ~}
@ -115,6 +113,7 @@ storage:
${kubeconfig}
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184
@ -130,7 +129,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/usr/local/bin/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -1,4 +1,14 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.12"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
aws = ">= 2.23, <= 4.0"
template = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -71,9 +71,9 @@ resource "aws_launch_configuration" "worker" {
# Worker Ignition config
data "ct_config" "worker-ignition" {
content = data.template_file.worker-config.rendered
pretty_print = false
snippets = var.snippets
content = data.template_file.worker-config.rendered
strict = true
snippets = var.snippets
}
# Worker Container Linux config

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/), SELinux enforcing
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [spot](https://typhoon.psdn.io/cl/aws/#spot) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: etcd-member.service
@ -28,7 +28,7 @@ systemd:
--network host \
--volume /var/lib/etcd:/var/lib/etcd:rw,Z \
--volume /etc/ssl/etcd:/etc/ssl/certs:ro,Z \
quay.io/coreos/etcd:v3.4.9
quay.io/coreos/etcd:v3.4.12
ExecStop=/usr/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
@ -38,11 +38,12 @@ systemd:
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
Description=Wait for DNS and hostname
Before=kubelet.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStartPre=/bin/sh -c 'while [ `hostname -s` == "localhost" ]; do sleep 1; done;'
ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
[Install]
RequiredBy=kubelet.service
@ -54,7 +55,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -92,16 +93,13 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/podman stop kubelet
@ -126,11 +124,13 @@ systemd:
--volume /opt/bootstrap/assets:/assets:ro,Z \
--volume /opt/bootstrap/apply:/apply:ro,Z \
--entrypoint=/apply \
quay.io/poseidon/kubelet:v1.18.3
quay.io/poseidon/kubelet:v1.19.2
ExecStartPost=/bin/touch /opt/bootstrap/bootstrap.done
ExecStartPost=-/usr/bin/podman stop bootstrap
storage:
directories:
- path: /var/lib/etcd
mode: 0700
- path: /etc/kubernetes
- path: /opt/bootstrap
files:
@ -160,6 +160,7 @@ storage:
mv manifests /opt/bootstrap/assets/manifests
mv manifests-networking/* /opt/bootstrap/assets/manifests/
rm -rf assets auth static-manifests tls manifests-networking
chcon -R -u system_u -t container_file_t /etc/kubernetes/bootstrap-secrets
- path: /opt/bootstrap/apply
mode: 0544
contents:
@ -178,6 +179,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -13,6 +13,30 @@ resource "aws_security_group" "controller" {
}
}
resource "aws_security_group_rule" "controller-icmp" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-icmp-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
self = true
}
resource "aws_security_group_rule" "controller-ssh" {
security_group_id = aws_security_group.controller.id
@ -44,39 +68,31 @@ resource "aws_security_group_rule" "controller-etcd-metrics" {
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-proxy
resource "aws_security_group_rule" "kube-proxy-metrics" {
resource "aws_security_group_rule" "controller-cilium-health" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10249
to_port = 10249
from_port = 4240
to_port = 4240
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-scheduler
resource "aws_security_group_rule" "controller-scheduler-metrics" {
resource "aws_security_group_rule" "controller-cilium-health-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10251
to_port = 10251
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-controller-manager
resource "aws_security_group_rule" "controller-manager-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10252
to_port = 10252
source_security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
self = true
}
# IANA VXLAN default
resource "aws_security_group_rule" "controller-vxlan" {
count = var.networking == "flannel" ? 1 : 0
@ -111,6 +127,31 @@ resource "aws_security_group_rule" "controller-apiserver" {
cidr_blocks = ["0.0.0.0/0"]
}
# Linux VXLAN default
resource "aws_security_group_rule" "controller-linux-vxlan" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-linux-vxlan-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
# Allow Prometheus to scrape node-exporter daemonset
resource "aws_security_group_rule" "controller-node-exporter" {
security_group_id = aws_security_group.controller.id
@ -122,6 +163,17 @@ resource "aws_security_group_rule" "controller-node-exporter" {
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-proxy
resource "aws_security_group_rule" "kube-proxy-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10249
to_port = 10249
source_security_group_id = aws_security_group.worker.id
}
# Allow apiserver to access kubelets for exec, log, port-forward
resource "aws_security_group_rule" "controller-kubelet" {
security_group_id = aws_security_group.controller.id
@ -143,6 +195,28 @@ resource "aws_security_group_rule" "controller-kubelet-self" {
self = true
}
# Allow Prometheus to scrape kube-scheduler
resource "aws_security_group_rule" "controller-scheduler-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10251
to_port = 10251
source_security_group_id = aws_security_group.worker.id
}
# Allow Prometheus to scrape kube-controller-manager
resource "aws_security_group_rule" "controller-manager-metrics" {
security_group_id = aws_security_group.controller.id
type = "ingress"
protocol = "tcp"
from_port = 10252
to_port = 10252
source_security_group_id = aws_security_group.worker.id
}
resource "aws_security_group_rule" "controller-bgp" {
security_group_id = aws_security_group.controller.id
@ -227,6 +301,30 @@ resource "aws_security_group" "worker" {
}
}
resource "aws_security_group_rule" "worker-icmp" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-icmp-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "icmp"
from_port = 8
to_port = 0
self = true
}
resource "aws_security_group_rule" "worker-ssh" {
security_group_id = aws_security_group.worker.id
@ -257,6 +355,31 @@ resource "aws_security_group_rule" "worker-https" {
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "worker-cilium-health" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-cilium-health-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "tcp"
from_port = 4240
to_port = 4240
self = true
}
# IANA VXLAN default
resource "aws_security_group_rule" "worker-vxlan" {
count = var.networking == "flannel" ? 1 : 0
@ -281,6 +404,31 @@ resource "aws_security_group_rule" "worker-vxlan-self" {
self = true
}
# Linux VXLAN default
resource "aws_security_group_rule" "worker-linux-vxlan" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
source_security_group_id = aws_security_group.controller.id
}
resource "aws_security_group_rule" "worker-linux-vxlan-self" {
count = var.networking == "cilium" ? 1 : 0
security_group_id = aws_security_group.worker.id
type = "ingress"
protocol = "udp"
from_port = 8472
to_port = 8472
self = true
}
# Allow Prometheus to scrape node-exporter daemonset
resource "aws_security_group_rule" "worker-node-exporter" {
security_group_id = aws_security_group.worker.id

View File

@ -1,11 +1,15 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
aws = "~> 2.23"
ct = "~> 0.4"
aws = ">= 2.23, <= 4.0"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: docker.service
@ -9,11 +9,12 @@ systemd:
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
Description=Wait for DNS and hostname
Before=kubelet.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStartPre=/bin/sh -c 'while [ `hostname -s` == "localhost" ]; do sleep 1; done;'
ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
[Install]
RequiredBy=kubelet.service
@ -24,7 +25,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -62,10 +63,8 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in split(",", node_labels) ~}
@ -90,7 +89,7 @@ systemd:
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/true
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.18.3 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.19.2 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
[Install]
WantedBy=multi-user.target
storage:
@ -106,6 +105,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,4 +1,14 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.12"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
aws = ">= 2.23, <= 4.0"
template = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [low-priority](https://typhoon.psdn.io/cl/azure/#low-priority) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]

View File

@ -2,12 +2,12 @@
systemd:
units:
- name: etcd-member.service
enable: true
enabled: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.4.9"
Environment="ETCD_IMAGE_TAG=v3.4.12"
Environment="ETCD_IMAGE_URL=docker://quay.io/coreos/etcd"
Environment="RKT_RUN_ARGS=--insecure-options=image"
Environment="ETCD_NAME=${etcd_name}"
@ -28,11 +28,11 @@ systemd:
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -46,13 +46,14 @@ systemd:
RequiredBy=kubelet.service
RequiredBy=etcd-member.service
- name: kubelet.service
enable: true
enabled: true
contents: |
[Unit]
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -96,20 +97,18 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
--cgroup-driver=$${KUBELET_CGROUP_DRIVER} \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -135,7 +134,7 @@ systemd:
--volume script,kind=host,source=/opt/bootstrap/apply \
--mount volume=script,target=/apply \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/apply
@ -143,6 +142,11 @@ systemd:
[Install]
WantedBy=multi-user.target
storage:
directories:
- path: /var/lib/etcd
filesystem: root
mode: 0700
overwrite: true
files:
- path: /etc/kubernetes/kubeconfig
filesystem: root
@ -164,6 +168,7 @@ storage:
mv tls/etcd/etcd-client* /etc/kubernetes/bootstrap-secrets/
chown -R etcd:etcd /etc/ssl/etcd
chmod -R 500 /etc/ssl/etcd
chmod -R 700 /var/lib/etcd
mv auth/kubeconfig /etc/kubernetes/bootstrap-secrets/
mv tls/k8s/* /etc/kubernetes/bootstrap-secrets/
mkdir -p /etc/kubernetes/manifests
@ -189,6 +194,7 @@ storage:
done
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184

View File

@ -139,10 +139,10 @@ resource "azurerm_network_interface_backend_address_pool_association" "controlle
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
pretty_print = false
snippets = var.controller_snippets
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
strict = true
snippets = var.controller_snippets
}
# Controller Container Linux configs
@ -157,6 +157,7 @@ data "template_file" "controller-configs" {
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = join(",", data.template_file.etcds.*.rendered)
cgroup_driver = local.flavor == "flatcar" && local.channel == "edge" ? "systemd" : "cgroupfs"
kubeconfig = indent(10, module.bootstrap.kubeconfig-kubelet)
ssh_authorized_key = var.ssh_authorized_key
cluster_dns_service_ip = cidrhost(var.service_cidr, 10)

View File

@ -7,6 +7,21 @@ resource "azurerm_network_security_group" "controller" {
location = azurerm_resource_group.cluster.location
}
resource "azurerm_network_security_rule" "controller-icmp" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-icmp"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "1995"
access = "Allow"
direction = "Inbound"
protocol = "Icmp"
source_port_range = "*"
destination_port_range = "*"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-ssh" {
resource_group_name = azurerm_resource_group.cluster.name
@ -100,6 +115,22 @@ resource "azurerm_network_security_rule" "controller-apiserver" {
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-cilium-health" {
resource_group_name = azurerm_resource_group.cluster.name
count = var.networking == "cilium" ? 1 : 0
name = "allow-cilium-health"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "2019"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "4240"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
@ -115,6 +146,21 @@ resource "azurerm_network_security_rule" "controller-vxlan" {
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-linux-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-linux-vxlan"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "2021"
access = "Allow"
direction = "Inbound"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "8472"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
# Allow Prometheus to scrape node-exporter daemonset
resource "azurerm_network_security_rule" "controller-node-exporter" {
resource_group_name = azurerm_resource_group.cluster.name
@ -191,6 +237,21 @@ resource "azurerm_network_security_group" "worker" {
location = azurerm_resource_group.cluster.location
}
resource "azurerm_network_security_rule" "worker-icmp" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-icmp"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "1995"
access = "Allow"
direction = "Inbound"
protocol = "Icmp"
source_port_range = "*"
destination_port_range = "*"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-ssh" {
resource_group_name = azurerm_resource_group.cluster.name
@ -236,6 +297,22 @@ resource "azurerm_network_security_rule" "worker-https" {
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-cilium-health" {
resource_group_name = azurerm_resource_group.cluster.name
count = var.networking == "cilium" ? 1 : 0
name = "allow-cilium-health"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "2014"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "4240"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
@ -251,6 +328,21 @@ resource "azurerm_network_security_rule" "worker-vxlan" {
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-linux-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-linux-vxlan"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "2016"
access = "Allow"
direction = "Inbound"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "8472"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
# Allow Prometheus to scrape node-exporter daemonset
resource "azurerm_network_security_rule" "worker-node-exporter" {
resource_group_name = azurerm_resource_group.cluster.name

View File

@ -1,12 +1,16 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
azurerm = "~> 2.8"
ct = "~> 0.3"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -2,11 +2,11 @@
systemd:
units:
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -19,13 +19,14 @@ systemd:
[Install]
RequiredBy=kubelet.service
- name: kubelet.service
enable: true
enabled: true
contents: |
[Unit]
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -69,14 +70,13 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
--cgroup-driver=$${KUBELET_CGROUP_DRIVER} \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in split(",", node_labels) ~}
@ -92,7 +92,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: delete-node.service
enable: true
enabled: true
contents: |
[Unit]
Description=Waiting to delete Kubernetes node on shutdown
@ -113,6 +113,7 @@ storage:
${kubeconfig}
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184
@ -128,7 +129,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/usr/local/bin/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname | tr '[:upper:]' '[:lower:]')

View File

@ -1,4 +1,14 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.12"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
azurerm = "~> 2.8"
template = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -97,9 +97,9 @@ resource "azurerm_monitor_autoscale_setting" "workers" {
# Worker Ignition configs
data "ct_config" "worker-ignition" {
content = data.template_file.worker-config.rendered
pretty_print = false
snippets = var.snippets
content = data.template_file.worker-config.rendered
strict = true
snippets = var.snippets
}
# Worker Container Linux configs
@ -111,6 +111,7 @@ data "template_file" "worker-config" {
ssh_authorized_key = var.ssh_authorized_key
cluster_dns_service_ip = cidrhost(var.service_cidr, 10)
cluster_domain_suffix = var.cluster_domain_suffix
cgroup_driver = local.flavor == "flatcar" && local.channel == "edge" ? "systemd" : "cgroupfs"
node_labels = join(",", var.node_labels)
}
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/), SELinux enforcing
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [spot priority](https://typhoon.psdn.io/fedora-coreos/azure/#low-priority) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]
@ -10,8 +10,9 @@ module "bootstrap" {
networking = var.networking
# only effective with Calico networking
# we should be able to use 1450 MTU, but in practice, 1410 was needed
network_encapsulation = "vxlan"
network_mtu = "1450"
network_mtu = "1410"
pod_cidr = var.pod_cidr
service_cidr = var.service_cidr

View File

@ -113,10 +113,10 @@ resource "azurerm_network_interface_backend_address_pool_association" "controlle
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
pretty_print = false
snippets = var.controller_snippets
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
strict = true
snippets = var.controller_snippets
}
# Controller Fedora CoreOS configs

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: etcd-member.service
@ -28,7 +28,7 @@ systemd:
--network host \
--volume /var/lib/etcd:/var/lib/etcd:rw,Z \
--volume /etc/ssl/etcd:/etc/ssl/certs:ro,Z \
quay.io/coreos/etcd:v3.4.9
quay.io/coreos/etcd:v3.4.12
ExecStop=/usr/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
@ -54,7 +54,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -92,16 +92,13 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/podman stop kubelet
@ -126,11 +123,13 @@ systemd:
--volume /opt/bootstrap/assets:/assets:ro,Z \
--volume /opt/bootstrap/apply:/apply:ro,Z \
--entrypoint=/apply \
quay.io/poseidon/kubelet:v1.18.3
quay.io/poseidon/kubelet:v1.19.2
ExecStartPost=/bin/touch /opt/bootstrap/bootstrap.done
ExecStartPost=-/usr/bin/podman stop bootstrap
storage:
directories:
- path: /var/lib/etcd
mode: 0700
- path: /etc/kubernetes
- path: /opt/bootstrap
files:
@ -160,6 +159,7 @@ storage:
mv manifests /opt/bootstrap/assets/manifests
mv manifests-networking/* /opt/bootstrap/assets/manifests/
rm -rf assets auth static-manifests tls manifests-networking
chcon -R -u system_u -t container_file_t /etc/kubernetes/bootstrap-secrets
- path: /opt/bootstrap/apply
mode: 0544
contents:
@ -178,6 +178,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -7,6 +7,21 @@ resource "azurerm_network_security_group" "controller" {
location = azurerm_resource_group.cluster.location
}
resource "azurerm_network_security_rule" "controller-icmp" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-icmp"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "1995"
access = "Allow"
direction = "Inbound"
protocol = "Icmp"
source_port_range = "*"
destination_port_range = "*"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-ssh" {
resource_group_name = azurerm_resource_group.cluster.name
@ -100,6 +115,22 @@ resource "azurerm_network_security_rule" "controller-apiserver" {
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-cilium-health" {
resource_group_name = azurerm_resource_group.cluster.name
count = var.networking == "cilium" ? 1 : 0
name = "allow-cilium-health"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "2019"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "4240"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
@ -115,6 +146,21 @@ resource "azurerm_network_security_rule" "controller-vxlan" {
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
resource "azurerm_network_security_rule" "controller-linux-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-linux-vxlan"
network_security_group_name = azurerm_network_security_group.controller.name
priority = "2021"
access = "Allow"
direction = "Inbound"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "8472"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.controller.address_prefix
}
# Allow Prometheus to scrape node-exporter daemonset
resource "azurerm_network_security_rule" "controller-node-exporter" {
resource_group_name = azurerm_resource_group.cluster.name
@ -191,6 +237,21 @@ resource "azurerm_network_security_group" "worker" {
location = azurerm_resource_group.cluster.location
}
resource "azurerm_network_security_rule" "worker-icmp" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-icmp"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "1995"
access = "Allow"
direction = "Inbound"
protocol = "Icmp"
source_port_range = "*"
destination_port_range = "*"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-ssh" {
resource_group_name = azurerm_resource_group.cluster.name
@ -236,6 +297,22 @@ resource "azurerm_network_security_rule" "worker-https" {
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-cilium-health" {
resource_group_name = azurerm_resource_group.cluster.name
count = var.networking == "cilium" ? 1 : 0
name = "allow-cilium-health"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "2014"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "4240"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
@ -251,6 +328,21 @@ resource "azurerm_network_security_rule" "worker-vxlan" {
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
resource "azurerm_network_security_rule" "worker-linux-vxlan" {
resource_group_name = azurerm_resource_group.cluster.name
name = "allow-linux-vxlan"
network_security_group_name = azurerm_network_security_group.worker.name
priority = "2016"
access = "Allow"
direction = "Inbound"
protocol = "Udp"
source_port_range = "*"
destination_port_range = "8472"
source_address_prefixes = [azurerm_subnet.controller.address_prefix, azurerm_subnet.worker.address_prefix]
destination_address_prefix = azurerm_subnet.worker.address_prefix
}
# Allow Prometheus to scrape node-exporter daemonset
resource "azurerm_network_security_rule" "worker-node-exporter" {
resource_group_name = azurerm_resource_group.cluster.name

View File

@ -1,12 +1,16 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
azurerm = "~> 2.8"
ct = "~> 0.3"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: docker.service
@ -24,7 +24,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -62,10 +62,8 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in split(",", node_labels) ~}
@ -90,7 +88,7 @@ systemd:
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/true
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.18.3 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.19.2 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
[Install]
WantedBy=multi-user.target
storage:
@ -106,6 +104,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,4 +1,14 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.12"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
azurerm = "~> 2.8"
template = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
}
}

View File

@ -72,9 +72,9 @@ resource "azurerm_monitor_autoscale_setting" "workers" {
# Worker Ignition configs
data "ct_config" "worker-ignition" {
content = data.template_file.worker-config.rendered
pretty_print = false
snippets = var.snippets
content = data.template_file.worker-config.rendered
strict = true
snippets = var.snippets
}
# Worker Fedora CoreOS configs

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [var.k8s_domain_name]

View File

@ -2,12 +2,12 @@
systemd:
units:
- name: etcd-member.service
enable: true
enabled: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.4.9"
Environment="ETCD_IMAGE_TAG=v3.4.12"
Environment="ETCD_IMAGE_URL=docker://quay.io/coreos/etcd"
Environment="RKT_RUN_ARGS=--insecure-options=image"
Environment="ETCD_NAME=${etcd_name}"
@ -28,11 +28,11 @@ systemd:
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
enabled: true
contents: |
[Unit]
Description=Watch for kubeconfig
@ -41,7 +41,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -60,7 +60,7 @@ systemd:
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -114,17 +114,14 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=${domain_name} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -150,7 +147,7 @@ systemd:
--volume script,kind=host,source=/opt/bootstrap/apply \
--mount volume=script,target=/apply \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/apply
@ -159,8 +156,13 @@ systemd:
WantedBy=multi-user.target
storage:
directories:
- path: /var/lib/etcd
filesystem: root
mode: 0700
overwrite: true
- path: /etc/kubernetes
filesystem: root
mode: 0755
files:
- path: /etc/hostname
filesystem: root
@ -182,6 +184,7 @@ storage:
mv tls/etcd/etcd-client* /etc/kubernetes/bootstrap-secrets/
chown -R etcd:etcd /etc/ssl/etcd
chmod -R 500 /etc/ssl/etcd
chmod -R 700 /var/lib/etcd
mv auth/kubeconfig /etc/kubernetes/bootstrap-secrets/
mv tls/k8s/* /etc/kubernetes/bootstrap-secrets/
mkdir -p /etc/kubernetes/manifests
@ -207,6 +210,7 @@ storage:
done
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184

View File

@ -2,7 +2,7 @@
systemd:
units:
- name: installer.service
enable: true
enabled: true
contents: |
[Unit]
Requires=network-online.target

View File

@ -2,11 +2,11 @@
systemd:
units:
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
enabled: true
contents: |
[Unit]
Description=Watch for kubeconfig
@ -15,7 +15,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -33,7 +33,7 @@ systemd:
Description=Kubelet
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
Environment=KUBELET_CGROUP_DRIVER=${cgroup_driver}
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -87,11 +87,9 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=${domain_name} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in compact(split(",", node_labels)) ~}
@ -114,6 +112,7 @@ storage:
directories:
- path: /etc/kubernetes
filesystem: root
mode: 0755
files:
- path: /etc/hostname
filesystem: root
@ -123,6 +122,7 @@ storage:
${domain_name}
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184

View File

@ -141,10 +141,10 @@ resource "matchbox_profile" "controllers" {
}
data "ct_config" "controller-ignitions" {
count = length(var.controllers)
content = data.template_file.controller-configs.*.rendered[count.index]
pretty_print = false
snippets = lookup(var.snippets, var.controllers.*.name[count.index], [])
count = length(var.controllers)
content = data.template_file.controller-configs.*.rendered[count.index]
strict = true
snippets = lookup(var.snippets, var.controllers.*.name[count.index], [])
}
data "template_file" "controller-configs" {
@ -171,10 +171,10 @@ resource "matchbox_profile" "workers" {
}
data "ct_config" "worker-ignitions" {
count = length(var.workers)
content = data.template_file.worker-configs.*.rendered[count.index]
pretty_print = false
snippets = lookup(var.snippets, var.workers.*.name[count.index], [])
count = length(var.workers)
content = data.template_file.worker-configs.*.rendered[count.index]
strict = true
snippets = lookup(var.snippets, var.workers.*.name[count.index], [])
}
data "template_file" "worker-configs" {

View File

@ -1,12 +1,20 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
matchbox = "~> 0.3.0"
ct = "~> 0.3"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
matchbox = {
source = "poseidon/matchbox"
version = "~> 0.4.1"
}
}
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/), SELinux enforcing
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [var.k8s_domain_name]

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: etcd-member.service
@ -28,7 +28,7 @@ systemd:
--network host \
--volume /var/lib/etcd:/var/lib/etcd:rw,Z \
--volume /etc/ssl/etcd:/etc/ssl/certs:ro,Z \
quay.io/coreos/etcd:v3.4.9
quay.io/coreos/etcd:v3.4.12
ExecStop=/usr/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
@ -53,7 +53,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -93,17 +93,14 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=${domain_name} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/podman stop kubelet
@ -137,11 +134,13 @@ systemd:
--volume /opt/bootstrap/assets:/assets:ro,Z \
--volume /opt/bootstrap/apply:/apply:ro,Z \
--entrypoint=/apply \
quay.io/poseidon/kubelet:v1.18.3
quay.io/poseidon/kubelet:v1.19.2
ExecStartPost=/bin/touch /opt/bootstrap/bootstrap.done
ExecStartPost=-/usr/bin/podman stop bootstrap
storage:
directories:
- path: /var/lib/etcd
mode: 0700
- path: /etc/kubernetes
- path: /opt/bootstrap
files:
@ -171,6 +170,7 @@ storage:
mv manifests /opt/bootstrap/assets/manifests
mv manifests-networking/* /opt/bootstrap/assets/manifests/
rm -rf assets auth static-manifests tls manifests-networking
chcon -R -u system_u -t container_file_t /etc/kubernetes/bootstrap-secrets
- path: /opt/bootstrap/apply
mode: 0544
contents:
@ -189,6 +189,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: docker.service
@ -23,7 +23,7 @@ systemd:
Description=Kubelet (System Container)
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -63,11 +63,9 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=${domain_name} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
%{~ for label in compact(split(",", node_labels)) ~}
@ -108,6 +106,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,11 +1,19 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
matchbox = "~> 0.3.0"
ct = "~> 0.4"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
matchbox = {
source = "poseidon/matchbox"
version = "~> 0.4.1"
}
}
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [Cilium](https://github.com/cilium/cilium) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, and other [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]

View File

@ -2,12 +2,12 @@
systemd:
units:
- name: etcd-member.service
enable: true
enabled: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.4.9"
Environment="ETCD_IMAGE_TAG=v3.4.12"
Environment="ETCD_IMAGE_URL=docker://quay.io/coreos/etcd"
Environment="RKT_RUN_ARGS=--insecure-options=image"
Environment="ETCD_NAME=${etcd_name}"
@ -28,11 +28,11 @@ systemd:
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
enabled: true
contents: |
[Unit]
Description=Watch for kubeconfig
@ -41,7 +41,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -62,7 +62,7 @@ systemd:
After=coreos-metadata.service
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
EnvironmentFile=/run/metadata/coreos
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -111,17 +111,14 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=$${COREOS_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -147,7 +144,7 @@ systemd:
--volume script,kind=host,source=/opt/bootstrap/apply \
--mount volume=script,target=/apply \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/apply
@ -156,8 +153,13 @@ systemd:
WantedBy=multi-user.target
storage:
directories:
- path: /var/lib/etcd
filesystem: root
mode: 0700
overwrite: true
- path: /etc/kubernetes
filesystem: root
mode: 0755
files:
- path: /opt/bootstrap/layout
filesystem: root
@ -173,6 +175,7 @@ storage:
mv tls/etcd/etcd-client* /etc/kubernetes/bootstrap-secrets/
chown -R etcd:etcd /etc/ssl/etcd
chmod -R 500 /etc/ssl/etcd
chmod -R 700 /var/lib/etcd
mv auth/kubeconfig /etc/kubernetes/bootstrap-secrets/
mv tls/k8s/* /etc/kubernetes/bootstrap-secrets/
mkdir -p /etc/kubernetes/manifests
@ -198,6 +201,7 @@ storage:
done
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184

View File

@ -2,11 +2,11 @@
systemd:
units:
- name: docker.service
enable: true
enabled: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
enabled: true
contents: |
[Unit]
Description=Watch for kubeconfig
@ -15,7 +15,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
enabled: true
contents: |
[Unit]
Description=Wait for DNS entries
@ -35,7 +35,7 @@ systemd:
After=coreos-metadata.service
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=docker://quay.io/poseidon/kubelet:v1.19.2
EnvironmentFile=/run/metadata/coreos
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -84,11 +84,9 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=$${COREOS_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
@ -101,7 +99,7 @@ systemd:
[Install]
WantedBy=multi-user.target
- name: delete-node.service
enable: true
enabled: true
contents: |
[Unit]
Description=Waiting to delete Kubernetes node on shutdown
@ -116,9 +114,11 @@ storage:
directories:
- path: /etc/kubernetes
filesystem: root
mode: 0755
files:
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
mode: 0644
contents:
inline: |
fs.inotify.max_user_watches=16184
@ -134,7 +134,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://quay.io/poseidon/kubelet:v1.18.3 \
docker://quay.io/poseidon/kubelet:v1.19.2 \
--net=host \
--dns=host \
--exec=/usr/local/bin/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -46,9 +46,10 @@ resource "digitalocean_droplet" "controllers" {
size = var.controller_type
# network
# only official DigitalOcean images support IPv6
ipv6 = local.is_official_image
private_networking = true
vpc_uuid = digitalocean_vpc.network.id
# TODO: Only official DigitalOcean images support IPv6
ipv6 = false
user_data = data.ct_config.controller-ignitions.*.rendered[count.index]
ssh_keys = var.ssh_fingerprints
@ -69,10 +70,10 @@ resource "digitalocean_tag" "controllers" {
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
pretty_print = false
snippets = var.controller_snippets
count = var.controller_count
content = data.template_file.controller-configs.*.rendered[count.index]
strict = true
snippets = var.controller_snippets
}
# Controller Container Linux configs

View File

@ -1,3 +1,10 @@
# Network VPC
resource "digitalocean_vpc" "network" {
name = var.cluster_name
region = var.region
description = "Network for ${var.cluster_name} cluster"
}
resource "digitalocean_firewall" "rules" {
name = var.cluster_name
@ -6,6 +13,11 @@ resource "digitalocean_firewall" "rules" {
digitalocean_tag.workers.name
]
inbound_rule {
protocol = "icmp"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# allow ssh, internal flannel, internal node-exporter, internal kubelet
inbound_rule {
protocol = "tcp"
@ -13,12 +25,27 @@ resource "digitalocean_firewall" "rules" {
source_addresses = ["0.0.0.0/0", "::/0"]
}
# Cilium health
inbound_rule {
protocol = "tcp"
port_range = "4240"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# IANA vxlan (flannel, calico)
inbound_rule {
protocol = "udp"
port_range = "4789"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# Linux vxlan (Cilium)
inbound_rule {
protocol = "udp"
port_range = "8472"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# Allow Prometheus to scrape node-exporter
inbound_rule {
protocol = "tcp"
@ -33,6 +60,7 @@ resource "digitalocean_firewall" "rules" {
source_tags = [digitalocean_tag.workers.name]
}
# Kubelet
inbound_rule {
protocol = "tcp"
port_range = "10250"

View File

@ -2,6 +2,8 @@ output "kubeconfig-admin" {
value = module.bootstrap.kubeconfig-admin
}
# Outputs for Kubernetes Ingress
output "controllers_dns" {
value = digitalocean_record.controllers[0].fqdn
}
@ -45,3 +47,10 @@ output "worker_tag" {
value = digitalocean_tag.workers.name
}
# Outputs for custom load balancing
output "vpc_id" {
description = "ID of the cluster VPC"
value = digitalocean_vpc.network.id
}

View File

@ -1,12 +1,20 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
digitalocean = "~> 1.3"
ct = "~> 0.3"
template = "~> 2.1"
null = "~> 2.1"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 1.20"
}
}
}

View File

@ -35,9 +35,10 @@ resource "digitalocean_droplet" "workers" {
size = var.worker_type
# network
# only official DigitalOcean images support IPv6
ipv6 = local.is_official_image
private_networking = true
vpc_uuid = digitalocean_vpc.network.id
# only official DigitalOcean images support IPv6
ipv6 = local.is_official_image
user_data = data.ct_config.worker-ignition.rendered
ssh_keys = var.ssh_fingerprints
@ -58,9 +59,9 @@ resource "digitalocean_tag" "workers" {
# Worker Ignition config
data "ct_config" "worker-ignition" {
content = data.template_file.worker-config.rendered
pretty_print = false
snippets = var.worker_snippets
content = data.template_file.worker-config.rendered
strict = true
snippets = var.worker_snippets
}
# Worker Container Linux config

View File

@ -11,7 +11,7 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.18.3 (upstream)
* Kubernetes v1.19.2 (upstream)
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/), SELinux enforcing
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/) customization

View File

@ -1,6 +1,6 @@
# Kubernetes assets (kubeconfig, manifests)
module "bootstrap" {
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=ff7ec52d0a5e97b8ca6b86a80a7e5e1ea8570487"
source = "git::https://github.com/poseidon/terraform-render-bootstrap.git?ref=d0f2123c5971410dc14aecde2307eb13e89c2bdf"
cluster_name = var.cluster_name
api_servers = [format("%s.%s", var.cluster_name, var.dns_zone)]

View File

@ -41,9 +41,10 @@ resource "digitalocean_droplet" "controllers" {
size = var.controller_type
# network
# TODO: Only official DigitalOcean images support IPv6
ipv6 = false
private_networking = true
vpc_uuid = digitalocean_vpc.network.id
# TODO: Only official DigitalOcean images support IPv6
ipv6 = false
user_data = data.ct_config.controller-ignitions.*.rendered[count.index]
ssh_keys = var.ssh_fingerprints

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: etcd-member.service
@ -28,7 +28,7 @@ systemd:
--network host \
--volume /var/lib/etcd:/var/lib/etcd:rw,Z \
--volume /etc/ssl/etcd:/etc/ssl/certs:ro,Z \
quay.io/coreos/etcd:v3.4.9
quay.io/coreos/etcd:v3.4.12
ExecStop=/usr/bin/podman stop etcd
[Install]
WantedBy=multi-user.target
@ -55,7 +55,7 @@ systemd:
After=afterburn.service
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
EnvironmentFile=/run/metadata/afterburn
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -94,17 +94,14 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=$${AFTERBURN_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/master \
--node-labels=node.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--register-with-taints=node-role.kubernetes.io/controller=:NoSchedule \
--rotate-certificates \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/podman stop kubelet
@ -138,11 +135,13 @@ systemd:
--volume /opt/bootstrap/assets:/assets:ro,Z \
--volume /opt/bootstrap/apply:/apply:ro,Z \
--entrypoint=/apply \
quay.io/poseidon/kubelet:v1.18.3
quay.io/poseidon/kubelet:v1.19.2
ExecStartPost=/bin/touch /opt/bootstrap/bootstrap.done
ExecStartPost=-/usr/bin/podman stop bootstrap
storage:
directories:
- path: /var/lib/etcd
mode: 0700
- path: /etc/kubernetes
- path: /opt/bootstrap
files:
@ -167,6 +166,7 @@ storage:
mv manifests /opt/bootstrap/assets/manifests
mv manifests-networking/* /opt/bootstrap/assets/manifests/
rm -rf assets auth static-manifests tls manifests-networking
chcon -R -u system_u -t container_file_t /etc/kubernetes/bootstrap-secrets
- path: /opt/bootstrap/apply
mode: 0544
contents:
@ -185,6 +185,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,6 +1,6 @@
---
variant: fcos
version: 1.0.0
version: 1.1.0
systemd:
units:
- name: docker.service
@ -26,7 +26,7 @@ systemd:
After=afterburn.service
Wants=rpc-statd.service
[Service]
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.18.3
Environment=KUBELET_IMAGE=quay.io/poseidon/kubelet:v1.19.2
EnvironmentFile=/run/metadata/afterburn
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -65,11 +65,9 @@ systemd:
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--healthz-port=0 \
--hostname-override=$${AFTERBURN_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
@ -100,7 +98,7 @@ systemd:
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/true
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.18.3 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
ExecStop=/bin/bash -c '/usr/bin/podman run --volume /etc/kubernetes:/etc/kubernetes:ro,z --entrypoint /usr/local/bin/kubectl quay.io/poseidon/kubelet:v1.19.2 --kubeconfig=/etc/kubernetes/kubeconfig delete node $HOSTNAME'
[Install]
WantedBy=multi-user.target
storage:
@ -111,6 +109,18 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
- path: /etc/sysctl.d/reverse-path-filter.conf
contents:
inline: |
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.*.rp_filter=0
- path: /etc/systemd/network/50-flannel.link
contents:
inline: |
[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none
- path: /etc/systemd/system.conf.d/accounting.conf
contents:
inline: |

View File

@ -1,3 +1,10 @@
# Network VPC
resource "digitalocean_vpc" "network" {
name = var.cluster_name
region = var.region
description = "Network for ${var.cluster_name} cluster"
}
resource "digitalocean_firewall" "rules" {
name = var.cluster_name
@ -6,6 +13,11 @@ resource "digitalocean_firewall" "rules" {
digitalocean_tag.workers.name
]
inbound_rule {
protocol = "icmp"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# allow ssh, internal flannel, internal node-exporter, internal kubelet
inbound_rule {
protocol = "tcp"
@ -13,12 +25,27 @@ resource "digitalocean_firewall" "rules" {
source_addresses = ["0.0.0.0/0", "::/0"]
}
# Cilium health
inbound_rule {
protocol = "tcp"
port_range = "4240"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# IANA vxlan (flannel, calico)
inbound_rule {
protocol = "udp"
port_range = "4789"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# Linux vxlan (Cilium)
inbound_rule {
protocol = "udp"
port_range = "8472"
source_tags = [digitalocean_tag.controllers.name, digitalocean_tag.workers.name]
}
# Allow Prometheus to scrape node-exporter
inbound_rule {
protocol = "tcp"
@ -33,6 +60,7 @@ resource "digitalocean_firewall" "rules" {
source_tags = [digitalocean_tag.workers.name]
}
# Kubelet
inbound_rule {
protocol = "tcp"
port_range = "10250"

View File

@ -2,6 +2,8 @@ output "kubeconfig-admin" {
value = module.bootstrap.kubeconfig-admin
}
# Outputs for Kubernetes Ingress
output "controllers_dns" {
value = digitalocean_record.controllers[0].fqdn
}
@ -45,3 +47,9 @@ output "worker_tag" {
value = digitalocean_tag.workers.name
}
# Outputs for custom load balancing
output "vpc_id" {
description = "ID of the cluster VPC"
value = digitalocean_vpc.network.id
}

View File

@ -1,12 +1,20 @@
# Terraform version and plugin versions
terraform {
required_version = "~> 0.12.6"
required_version = ">= 0.12.26, < 0.14.0"
required_providers {
digitalocean = "~> 1.3"
ct = "~> 0.3"
template = "~> 2.1"
null = "~> 2.1"
template = "~> 2.1"
null = "~> 2.1"
ct = {
source = "poseidon/ct"
version = "~> 0.6.1"
}
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 1.20"
}
}
}

View File

@ -37,9 +37,10 @@ resource "digitalocean_droplet" "workers" {
size = var.worker_type
# network
# TODO: Only official DigitalOcean images support IPv6
ipv6 = false
private_networking = true
vpc_uuid = digitalocean_vpc.network.id
# TODO: Only official DigitalOcean images support IPv6
ipv6 = false
user_data = data.ct_config.worker-ignition.rendered
ssh_keys = var.ssh_fingerprints

39
docs/addons/fleetlock.md Normal file
View File

@ -0,0 +1,39 @@
## fleetlock
[fleetlock](https://github.com/poseidon/fleetlock) is a reboot coordinator for Fedora CoreOS nodes. It implements the [FleetLock](https://github.com/coreos/airlock/pull/1/files) protocol for use as a [Zincati](https://github.com/coreos/zincati) lock [strategy](https://github.com/coreos/zincati/blob/master/docs/usage/updates-strategy.md) backend.
Declare a Zincati `fleet_lock` strategy when provisioning Fedora CoreOS nodes via [snippets](/advanced/customization/#hosts).
```yaml
variant: fcos
version: 1.1.0
storage:
files:
- path: /etc/zincati/config.d/55-update-strategy.toml
contents:
inline: |
[updates]
strategy = "fleet_lock"
[updates.fleet_lock]
base_url = "http://10.3.0.15/"
```
```tf
module "nemo" {
...
controller_snippets = [
file("./snippets/zincati-strategy.yaml"),
]
worker_snippets = [
file("./snippets/zincati-strategy.yaml"),
]
}
```
Apply fleetlock based on the example manifests.
```sh
git clone git@github.com:poseidon/fleetlock.git
kubectl apply -f examples/k8s
```

View File

@ -1,8 +1,9 @@
# Addons
Every Typhoon cluster is verified to work well with several post-install addons.
Typhoon clusters are verified to work well with several post-install addons.
* Nginx [Ingress Controller](ingress.md)
* [Prometheus](prometheus.md)
* [Grafana](grafana.md)
* [fleetlock](fleetlock.md)

View File

@ -37,7 +37,7 @@ For example, ensure an `/opt/hello` file is created with permissions 0644.
```yaml
# custom-files
variant: fcos
version: 1.0.0
version: 1.1.0
storage:
files:
- path: /opt/hello
@ -83,7 +83,7 @@ module "mercury" {
}
```
### Container Linux
### Flatcar Linux
Define a Container Linux Config (CLC) ([config](https://github.com/coreos/container-linux-config-transpiler/blob/master/doc/configuration.md), [examples](https://github.com/coreos/container-linux-config-transpiler/blob/master/doc/examples.md)) in version control near your Terraform workspace directory (e.g. perhaps in a `snippets` subdirectory). You may organize snippets into multiple files, if desired.
@ -125,7 +125,7 @@ systemd:
Environment="ETCD_LOG_PACKAGE_LEVELS=etcdserver=WARNING,security=DEBUG"
```
Reference the CLC contents by location (e.g. `file("./custom-units.yaml")`). On [AWS](/cl/aws/#cluster), [Azure](/cl/azure/#cluster), [DigitalOcean](/cl/digital-ocean/#cluster), or [Google Cloud](/cl/google-cloud/#cluster) extend the `controller_snippets` or `worker_snippets` list variables.
Reference the CLC contents by location (e.g. `file("./custom-units.yaml")`). On [AWS](/flatcar-linux/aws/#cluster), [Azure](/flatcar-linux/azure/#cluster), [DigitalOcean](/flatcar-linux/digital-ocean/#cluster), or [Google Cloud](/flatcar-linux/google-cloud/#cluster) extend the `controller_snippets` or `worker_snippets` list variables.
```tf
module "nemo" {
@ -145,7 +145,7 @@ module "nemo" {
}
```
On [Bare-Metal](/cl/bare-metal/#cluster), different CLCs may be used for each node (since hardware may be heterogeneous). Extend the `snippets` map variable by mapping a controller or worker name key to a list of snippets.
On [Bare-Metal](/flatcar-linux/bare-metal/#cluster), different CLCs may be used for each node (since hardware may be heterogeneous). Extend the `snippets` map variable by mapping a controller or worker name key to a list of snippets.
```tf
module "mercury" {
@ -174,3 +174,34 @@ module "nemo" {
To customize low-level Kubernetes control plane bootstrapping, see the [poseidon/terraform-render-bootstrap](https://github.com/poseidon/terraform-render-bootstrap) Terraform module.
## Kubelet
Typhoon publishes Kubelet [container images](/topics/security/#container-images) to Quay.io (default) and to Dockerhub (in case of a Quay [outage](https://github.com/poseidon/typhoon/issues/735) or breach). Quay automated builds also provide the option for fully verifiable tagged images (`build-{short_sha}`).
To set an alternative Kubelet image, use a snippet to set a systemd dropin.
```
# host-image-override.yaml
variant: fcos <- remove for Flatcar Linux
version: 1.1.0 <- remove for Flatcar Linux
systemd:
units:
- name: kubelet.service
dropins:
- name: 10-image-override.conf
contents: |
[Service]
Environment=KUBELET_IMAGE=docker.io/psdn/kubelet:v1.18.3
```
```
module "nemo" {
...
worker_snippets = [
file("./snippets/host-image-override.yaml")
]
...
}
```

View File

@ -15,26 +15,51 @@ Internal Terraform Modules:
Create a cluster following the AWS [tutorial](../flatcar-linux/aws.md#cluster). Define a worker pool using the AWS internal `workers` module.
```tf
module "tempest-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//aws/container-linux/kubernetes/workers?ref=v1.14.3"
=== "Fedora CoreOS"
# AWS
vpc_id = module.tempest.vpc_id
subnet_ids = module.tempest.subnet_ids
security_groups = module.tempest.worker_security_groups
```tf
module "tempest-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//aws/fedora-coreos/kubernetes/workers?ref=v1.19.2"
# configuration
name = "tempest-pool"
kubeconfig = module.tempest.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# AWS
vpc_id = module.tempest.vpc_id
subnet_ids = module.tempest.subnet_ids
security_groups = module.tempest.worker_security_groups
# optional
worker_count = 2
instance_type = "m5.large"
os_image = "flatcar-beta"
}
```
# configuration
name = "tempest-pool"
kubeconfig = module.tempest.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
instance_type = "m5.large"
os_stream = "next"
}
```
=== "Flatcar Linux"
```tf
module "tempest-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//aws/container-linux/kubernetes/workers?ref=v1.19.2"
# AWS
vpc_id = module.tempest.vpc_id
subnet_ids = module.tempest.subnet_ids
security_groups = module.tempest.worker_security_groups
# configuration
name = "tempest-pool"
kubeconfig = module.tempest.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
instance_type = "m5.large"
os_image = "flatcar-beta"
}
```
Apply the change.
@ -65,12 +90,13 @@ The AWS internal `workers` module supports a number of [variables](https://githu
|:-----|:------------|:--------|:--------|
| worker_count | Number of instances | 1 | 3 |
| instance_type | EC2 instance type | "t3.small" | "t3.medium" |
| os_image | AMI channel for a Container Linux derivative | "flatcar-stable" | flatcar-stable, flatcar-beta, flatcar-alph, coreos-stable, coreos-beta, coreos-alpha |
| os_image | AMI channel for a Container Linux derivative | "flatcar-stable" | flatcar-stable, flatcar-beta, flatcar-alpha, flatcar-edge |
| os_stream | Fedora CoreOS stream for compute instances | "stable" | "testing", "next" |
| disk_size | Size of the EBS volume in GB | 40 | 100 |
| disk_type | Type of the EBS volume | "gp2" | standard, gp2, io1 |
| disk_iops | IOPS of the EBS volume | 0 (i.e. auto) | 400 |
| spot_price | Spot price in USD for worker instances or 0 to use on-demand instances | 0 | 0.10 |
| snippets | Container Linux Config snippets | [] | [examples](/advanced/customization/) |
| snippets | Fedora CoreOS or Container Linux Config snippets | [] | [examples](/advanced/customization/) |
| service_cidr | Must match `service_cidr` of cluster | "10.3.0.0/16" | "10.3.0.0/24" |
| node_labels | List of initial node labels | [] | ["worker-pool=foo"] |
@ -80,28 +106,57 @@ Check the list of valid [instance types](https://aws.amazon.com/ec2/instance-typ
Create a cluster following the Azure [tutorial](../flatcar-linux/azure.md#cluster). Define a worker pool using the Azure internal `workers` module.
```tf
module "ramius-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//azure/container-linux/kubernetes/workers?ref=v1.18.3"
=== "Fedora CoreOS"
# Azure
region = module.ramius.region
resource_group_name = module.ramius.resource_group_name
subnet_id = module.ramius.subnet_id
security_group_id = module.ramius.security_group_id
backend_address_pool_id = module.ramius.backend_address_pool_id
```tf
module "ramius-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//azure/fedora-coreos/kubernetes/workers?ref=v1.19.2"
# configuration
name = "ramius-spot"
kubeconfig = module.ramius.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# Azure
region = module.ramius.region
resource_group_name = module.ramius.resource_group_name
subnet_id = module.ramius.subnet_id
security_group_id = module.ramius.security_group_id
backend_address_pool_id = module.ramius.backend_address_pool_id
# optional
worker_count = 2
vm_type = "Standard_F4"
priority = "Spot"
}
```
# configuration
name = "ramius-spot"
kubeconfig = module.ramius.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
vm_type = "Standard_F4"
priority = "Spot"
os_image = "/subscriptions/some/path/Microsoft.Compute/images/fedora-coreos-31.20200323.3.2"
}
```
=== "Flatcar Linux"
```tf
module "ramius-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//azure/container-linux/kubernetes/workers?ref=v1.19.2"
# Azure
region = module.ramius.region
resource_group_name = module.ramius.resource_group_name
subnet_id = module.ramius.subnet_id
security_group_id = module.ramius.security_group_id
backend_address_pool_id = module.ramius.backend_address_pool_id
# configuration
name = "ramius-spot"
kubeconfig = module.ramius.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
vm_type = "Standard_F4"
priority = "Spot"
os_image = "flatcar-beta"
}
```
Apply the change.
@ -134,7 +189,7 @@ The Azure internal `workers` module supports a number of [variables](https://git
|:-----|:------------|:--------|:--------|
| worker_count | Number of instances | 1 | 3 |
| vm_type | Machine type for instances | "Standard_DS1_v2" | See below |
| os_image | Channel for a Container Linux derivative | "flatcar-stable" | flatcar-stable, flatcar-beta, flatcar-alpha, flatcar-edge, coreos-stable, coreos-beta, coreos-alpha |
| os_image | Channel for a Container Linux derivative | "flatcar-stable" | flatcar-stable, flatcar-beta, flatcar-alpha, flatcar-edge |
| priority | Set priority to Spot to use reduced cost surplus capacity, with the tradeoff that instances can be deallocated at any time | "Regular" | "Spot" |
| snippets | Container Linux Config snippets | [] | [examples](/advanced/customization/) |
| service_cidr | CIDR IPv4 range to assign to Kubernetes services | "10.3.0.0/16" | "10.3.0.0/24" |
@ -146,27 +201,53 @@ Check the list of valid [machine types](https://azure.microsoft.com/en-us/pricin
Create a cluster following the Google Cloud [tutorial](../flatcar-linux/google-cloud.md#cluster). Define a worker pool using the Google Cloud internal `workers` module.
```tf
module "yavin-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes/workers?ref=v1.18.3"
=== "Fedora CoreOS"
# Google Cloud
region = "europe-west2"
network = module.yavin.network_name
cluster_name = "yavin"
```tf
module "yavin-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/fedora-coreos/kubernetes/workers?ref=v1.19.2"
# configuration
name = "yavin-16x"
kubeconfig = module.yavin.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# Google Cloud
region = "europe-west2"
network = module.yavin.network_name
cluster_name = "yavin"
# optional
worker_count = 2
machine_type = "n1-standard-16"
os_image = "coreos-beta"
preemptible = true
}
```
# configuration
name = "yavin-16x"
kubeconfig = module.yavin.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
machine_type = "n1-standard-16"
os_stream = "testing"
preemptible = true
}
```
=== "Flatcar Linux"
```tf
module "yavin-worker-pool" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes/workers?ref=v1.19.2"
# Google Cloud
region = "europe-west2"
network = module.yavin.network_name
cluster_name = "yavin"
# configuration
name = "yavin-16x"
kubeconfig = module.yavin.kubeconfig
ssh_authorized_key = var.ssh_authorized_key
# optional
worker_count = 2
machine_type = "n1-standard-16"
os_image = "flatcar-linux-2303-4-0" # custom
preemptible = true
}
```
Apply the change.
@ -179,11 +260,11 @@ Verify a managed instance group of workers joins the cluster within a few minute
```
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-0.c.example-com.internal Ready 6m v1.18.3
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.18.3
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.18.3
yavin-16x-worker-jrbf.c.example-com.internal Ready 3m v1.18.3
yavin-16x-worker-mzdm.c.example-com.internal Ready 3m v1.18.3
yavin-controller-0.c.example-com.internal Ready 6m v1.19.2
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.19.2
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.19.2
yavin-16x-worker-jrbf.c.example-com.internal Ready 3m v1.19.2
yavin-16x-worker-mzdm.c.example-com.internal Ready 3m v1.19.2
```
### Variables
@ -199,7 +280,7 @@ The Google Cloud internal `workers` module supports a number of [variables](http
| region | Region for the worker pool instances. May differ from the cluster's region | "europe-west2" |
| network | Must be set to `network_name` output by cluster | module.cluster.network_name |
| kubeconfig | Must be set to `kubeconfig` output by cluster | module.cluster.kubeconfig |
| os_image | Container Linux image for compute instances | "fedora-coreos-or-flatcar-image", coreos-stable, coreos-beta, coreos-alpha |
| os_image | Container Linux image for compute instances | "uploaded-flatcar-image" |
| ssh_authorized_key | SSH public key for user 'core' | "ssh-rsa AAAAB3NZ..." |
Check the list of regions [docs](https://cloud.google.com/compute/docs/regions-zones/regions-zones) or with `gcloud compute regions list`.

View File

@ -30,6 +30,7 @@ Add a DigitalOcean load balancer to distribute IPv4 TCP traffic (HTTP/HTTPS Ingr
resource "digitalocean_loadbalancer" "ingress" {
name = "ingress"
region = "fra1"
vpc_uuid = module.nemo.vpc_id
droplet_tag = module.nemo.worker_tag
healthcheck {

Some files were not shown because too many files have changed in this diff Show More