Improve Prometheus rules and alerts and Grafana dashboards

* Collate upstream rules, alerts, and dashboards and tune for use
in Typhoon
* Previously, a well-chosen (but older) set of rules, alerts, and
dashboards were maintained to reflect metric name changes
This commit is contained in:
Dalton Hubble 2019-02-18 12:05:27 -08:00
parent 6fa3b8a13f
commit e483c81ce9
11 changed files with 8906 additions and 6670 deletions

View File

@ -13,15 +13,16 @@ Notable changes between versions.
#### Addons #### Addons
* Improve Prometheus rules and alerts ([#404](https://github.com/poseidon/typhoon/pull/404))
* Drop extraneous metrics ([#397](https://github.com/poseidon/typhoon/pull/397))
* Add `pod` name label to metrics discovered via service endpoints
* Rename `kubernetes_namespace` label to `namespace`
* Improve Grafana and dashboards ([#403](https://github.com/poseidon/typhoon/pull/403), [#404](https://github.com/poseidon/typhoon/pull/404))
* Upgrade Grafana from v5.4.3 to 6.0.0-beta2
* Enable Grafana [Explore](http://docs.grafana.org/guides/whats-new-in-v6-0/#explore) UI as a Viewer (inspect/edit without saving)
* Raise nginx-ingress liveness/readiness timeout to 5 seconds * Raise nginx-ingress liveness/readiness timeout to 5 seconds
* Remove nginx-ingess default-backend ([#401](https://github.com/poseidon/typhoon/pull/401)) * Remove nginx-ingess default-backend ([#401](https://github.com/poseidon/typhoon/pull/401))
* nginx-ingress now responds with its own 404 page by default * nginx-ingress now responds with its own 404 page by default
* Improve Prometheus metrics labels and drop extraneous metrics ([#397](https://github.com/poseidon/typhoon/pull/397))
* Add `pod` name label to metrics discovered via service endpoints
* Rename `kubernetes_namespace` label to `namespace`
* Revamp the Grafana addon ([#403](https://github.com/poseidon/typhoon/pull/403))
* Upgrade Grafana from v5.4.3 to 6.0.0-beta2
* Enable Grafana [Explore](http://docs.grafana.org/guides/whats-new-in-v6-0/#explore) UI as a Viewer (inspect/edit without saving)
## v1.13.3 ## v1.13.3

File diff suppressed because it is too large Load Diff

View File

@ -4,582 +4,1089 @@ metadata:
name: prometheus-rules name: prometheus-rules
namespace: monitoring namespace: monitoring
data: data:
alertmanager.rules.yaml: | etcd.yaml: |-
groups: {
- name: alertmanager.rules "groups": [
rules: {
- alert: AlertmanagerConfigInconsistent "name": "etcd",
expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service) "rules": [
GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service", {
"alertmanager-$1", "alertmanager", "(.*)") != 1 "alert": "etcdInsufficientMembers",
for: 5m "annotations": {
labels: "message": "etcd cluster \"{{ $labels.job }}\": insufficient members ({{ $value }})."
severity: critical },
annotations: "expr": "sum(up{job=~\".*etcd.*\"} == bool 1) by (job) < ((count(up{job=~\".*etcd.*\"}) by (job) + 1) / 2)\n",
description: The configuration of the instances of the Alertmanager cluster "for": "3m",
`{{$labels.service}}` are out of sync. "labels": {
- alert: AlertmanagerDownOrMissing "severity": "critical"
expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1", }
"alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1 },
for: 5m {
labels: "alert": "etcdNoLeader",
severity: warning "annotations": {
annotations: "message": "etcd cluster \"{{ $labels.job }}\": member {{ $labels.instance }} has no leader."
description: An unexpected number of Alertmanagers are scraped or Alertmanagers },
disappeared from discovery. "expr": "etcd_server_has_leader{job=~\".*etcd.*\"} == 0\n",
- alert: AlertmanagerFailedReload "for": "1m",
expr: alertmanager_config_last_reload_successful == 0 "labels": {
for: 10m "severity": "critical"
labels: }
severity: warning },
annotations: {
description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace "alert": "etcdHighNumberOfLeaderChanges",
}}/{{ $labels.pod}}. "annotations": {
etcd3.rules.yaml: | "message": "etcd cluster \"{{ $labels.job }}\": instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last 30 minutes."
groups: },
- name: ./etcd3.rules "expr": "rate(etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}[15m]) > 3\n",
rules: "for": "15m",
- alert: InsufficientMembers "labels": {
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1) "severity": "warning"
for: 3m }
labels: },
severity: critical {
annotations: "alert": "etcdGRPCRequestsSlow",
description: If one more etcd member goes down the cluster will be unavailable "annotations": {
summary: etcd cluster insufficient members "message": "etcd cluster \"{{ $labels.job }}\": gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}."
- alert: NoLeader },
expr: etcd_server_has_leader{job="etcd"} == 0 "expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\", grpc_type=\"unary\"}[5m])) by (job, instance, grpc_service, grpc_method, le))\n> 0.15\n",
for: 1m "for": "10m",
labels: "labels": {
severity: critical "severity": "critical"
annotations: }
description: etcd member {{ $labels.instance }} has no leader },
summary: etcd member has no leader {
- alert: HighNumberOfLeaderChanges "alert": "etcdMemberCommunicationSlow",
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3 "annotations": {
labels: "message": "etcd cluster \"{{ $labels.job }}\": member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}."
severity: warning },
annotations: "expr": "histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.15\n",
description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader "for": "10m",
changes within the last hour "labels": {
summary: a high number of leader changes within the etcd cluster are happening "severity": "warning"
- alert: GRPCRequestsSlow }
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) },
> 0.15 {
for: 10m "alert": "etcdHighNumberOfFailedProposals",
labels: "annotations": {
severity: critical "message": "etcd cluster \"{{ $labels.job }}\": {{ $value }} proposal failures within the last 30 minutes on etcd instance {{ $labels.instance }}."
annotations: },
description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method "expr": "rate(etcd_server_proposals_failed_total{job=~\".*etcd.*\"}[15m]) > 5\n",
}} are slow "for": "15m",
summary: slow gRPC requests "labels": {
- alert: HighNumberOfFailedHTTPRequests "severity": "warning"
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) }
BY (method) > 0.01 },
for: 10m {
labels: "alert": "etcdHighFsyncDurations",
severity: warning "annotations": {
annotations: "message": "etcd cluster \"{{ $labels.job }}\": 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}."
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd },
instance {{ $labels.instance }}' "expr": "histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.5\n",
summary: a high number of HTTP requests are failing "for": "10m",
- alert: HighNumberOfFailedHTTPRequests "labels": {
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m])) "severity": "warning"
BY (method) > 0.05 }
for: 5m },
labels: {
severity: critical "alert": "etcdHighCommitDurations",
annotations: "annotations": {
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd "message": "etcd cluster \"{{ $labels.job }}\": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}."
instance {{ $labels.instance }}' },
summary: a high number of HTTP requests are failing "expr": "histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.25\n",
- alert: HTTPRequestsSlow "for": "10m",
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) "labels": {
> 0.15 "severity": "warning"
for: 10m }
labels: },
severity: warning {
annotations: "alert": "etcdHighNumberOfFailedHTTPRequests",
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method "annotations": {
}} are slow "message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}"
summary: slow HTTP requests },
- alert: EtcdMemberCommunicationSlow "expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.01\n",
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) "for": "10m",
> 0.15 "labels": {
for: 10m "severity": "warning"
labels: }
severity: warning },
annotations: {
description: etcd instance {{ $labels.instance }} member communication with "alert": "etcdHighNumberOfFailedHTTPRequests",
{{ $labels.To }} is slow "annotations": {
summary: etcd member communication is slow "message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}."
- alert: HighNumberOfFailedProposals },
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5 "expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.05\n",
labels: "for": "10m",
severity: warning "labels": {
annotations: "severity": "critical"
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal }
failures within the last hour },
summary: a high number of proposals within the etcd cluster are failing {
- alert: HighFsyncDurations "alert": "etcdHTTPRequestsSlow",
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) "annotations": {
> 0.5 "message": "etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow."
for: 10m },
labels: "expr": "histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))\n> 0.15\n",
severity: warning "for": "10m",
annotations: "labels": {
description: etcd instance {{ $labels.instance }} fync durations are high "severity": "warning"
summary: high fsync durations }
- alert: HighCommitDurations }
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) ]
> 0.25 }
for: 10m ]
labels: }
severity: warning extra.yaml: |-
annotations: {
description: etcd instance {{ $labels.instance }} commit durations are high "groups": [
summary: high commit durations {
general.rules.yaml: | "name": "extra.rules",
groups: "rules": [
- name: general.rules {
rules: "alert": "InactiveRAIDDisk",
- alert: TargetDown "annotations": {
expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10 "message": "{{ $value }} RAID disk(s) on node {{ $labels.instance }} are inactive."
for: 10m },
labels: "expr": "node_md_disks - node_md_disks_active > 0",
severity: warning "for": "10m",
annotations: "labels": {
description: '{{ $value }}% of {{ $labels.job }} targets are down.' "severity": "warning"
summary: Targets are down }
- record: fd_utilization }
expr: process_open_fds / process_max_fds ]
- alert: FdExhaustionClose }
expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1 ]
for: 10m }
labels: kube.yaml: |-
severity: warning {
annotations: "groups": [
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance {
will exhaust in file/socket descriptors within the next 4 hours' "name": "k8s.rules",
summary: file descriptors soon exhausted "rules": [
- alert: FdExhaustionClose {
expr: predict_linear(fd_utilization[10m], 3600) > 1 "expr": "sum(rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])) by (namespace)\n",
for: 10m "record": "namespace:container_cpu_usage_seconds_total:sum_rate"
labels: },
severity: critical {
annotations: "expr": "sum by (namespace, pod_name, container_name) (\n rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])\n)\n",
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance "record": "namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate"
will exhaust in file/socket descriptors within the next hour' },
summary: file descriptors soon exhausted {
kube-controller-manager.rules.yaml: | "expr": "sum(container_memory_usage_bytes{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}) by (namespace)\n",
groups: "record": "namespace:container_memory_usage_bytes:sum"
- name: kube-controller-manager.rules },
rules: {
- alert: K8SControllerManagerDown "expr": "sum by (namespace, label_name) (\n sum(rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])) by (namespace, pod_name)\n * on (namespace, pod_name) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
expr: absent(up{job="kube-controller-manager"} == 1) "record": "namespace_name:container_cpu_usage_seconds_total:sum_rate"
for: 5m },
labels: {
severity: critical "expr": "sum by (namespace, label_name) (\n sum(container_memory_usage_bytes{job=\"kubernetes-cadvisor\",image!=\"\", container_name!=\"\"}) by (pod_name, namespace)\n* on (namespace, pod_name) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
annotations: "record": "namespace_name:container_memory_usage_bytes:sum"
description: There is no running K8S controller manager. Deployments and replication },
controllers are not making progress. {
summary: Controller manager is down "expr": "sum by (namespace, label_name) (\n sum(kube_pod_container_resource_requests_memory_bytes{job=\"kube-state-metrics\"}) by (namespace, pod)\n* on (namespace, pod) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
kube-scheduler.rules.yaml: | "record": "namespace_name:kube_pod_container_resource_requests_memory_bytes:sum"
groups: },
- name: kube-scheduler.rules {
rules: "expr": "sum by (namespace, label_name) (\n sum(kube_pod_container_resource_requests_cpu_cores{job=\"kube-state-metrics\"} and on(pod) kube_pod_status_scheduled{condition=\"true\"}) by (namespace, pod)\n* on (namespace, pod) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile "record": "namespace_name:kube_pod_container_resource_requests_cpu_cores:sum"
expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) }
BY (le, cluster)) / 1e+06 ]
labels: },
quantile: "0.99" {
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile "name": "kube-scheduler.rules",
expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) "rules": [
BY (le, cluster)) / 1e+06 {
labels: "expr": "histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
quantile: "0.9" "labels": {
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile "quantile": "0.99"
expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) },
BY (le, cluster)) / 1e+06 "record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
labels: },
quantile: "0.5" {
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile "expr": "histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket) "labels": {
BY (le, cluster)) / 1e+06 "quantile": "0.99"
labels: },
quantile: "0.99" "record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile },
expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket) {
BY (le, cluster)) / 1e+06 "expr": "histogram_quantile(0.99, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
labels: "labels": {
quantile: "0.9" "quantile": "0.99"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile },
expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket) "record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
BY (le, cluster)) / 1e+06 },
labels: {
quantile: "0.5" "expr": "histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
- record: cluster:scheduler_binding_latency_seconds:quantile "labels": {
expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket) "quantile": "0.9"
BY (le, cluster)) / 1e+06 },
labels: "record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
quantile: "0.99" },
- record: cluster:scheduler_binding_latency_seconds:quantile {
expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket) "expr": "histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
BY (le, cluster)) / 1e+06 "labels": {
labels: "quantile": "0.9"
quantile: "0.9" },
- record: cluster:scheduler_binding_latency_seconds:quantile "record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket) },
BY (le, cluster)) / 1e+06 {
labels: "expr": "histogram_quantile(0.9, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
quantile: "0.5" "labels": {
- alert: K8SSchedulerDown "quantile": "0.9"
expr: absent(up{job="kube-scheduler"} == 1) },
for: 5m "record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
labels: },
severity: critical {
annotations: "expr": "histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
description: There is no running K8S scheduler. New pods are not being assigned "labels": {
to nodes. "quantile": "0.5"
summary: Scheduler is down },
kube-state-metrics.rules.yaml: | "record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
groups: },
- name: kube-state-metrics.rules {
rules: "expr": "histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
- alert: DeploymentGenerationMismatch "labels": {
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation "quantile": "0.5"
for: 15m },
labels: "record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
severity: warning },
annotations: {
description: Observed deployment generation does not match expected one for "expr": "histogram_quantile(0.5, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
deployment {{$labels.namespaces}}/{{$labels.deployment}} "labels": {
summary: Deployment is outdated "quantile": "0.5"
- alert: DeploymentReplicasNotUpdated },
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) "record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas)) }
unless (kube_deployment_spec_paused == 1) ]
for: 15m },
labels: {
severity: warning "name": "kube-apiserver.rules",
annotations: "rules": [
description: Replicas are not updated and available for deployment {{$labels.namespaces}}/{{$labels.deployment}} {
summary: Deployment replicas are outdated "expr": "histogram_quantile(0.99, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
- alert: DaemonSetRolloutStuck "labels": {
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled "quantile": "0.99"
* 100 < 100 },
for: 15m "record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
labels: },
severity: warning {
annotations: "expr": "histogram_quantile(0.9, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
description: Only {{$value}}% of desired pods scheduled and ready for daemon "labels": {
set {{$labels.namespaces}}/{{$labels.daemonset}} "quantile": "0.9"
summary: DaemonSet is missing pods },
- alert: K8SDaemonSetsNotScheduled "record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled },
> 0 {
for: 10m "expr": "histogram_quantile(0.5, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
labels: "labels": {
severity: warning "quantile": "0.5"
annotations: },
description: A number of daemonsets are not scheduled. "record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
summary: Daemonsets are not scheduled correctly }
- alert: DaemonSetsMissScheduled ]
expr: kube_daemonset_status_number_misscheduled > 0 },
for: 10m {
labels: "name": "node.rules",
severity: warning "rules": [
annotations: {
description: A number of daemonsets are running where they are not supposed "expr": "sum(min(kube_pod_info) by (node))",
to run. "record": ":kube_pod_info_node_count:"
summary: Daemonsets are not scheduled correctly },
- alert: PodFrequentlyRestarting {
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 "expr": "max(label_replace(kube_pod_info{job=\"kube-state-metrics\"}, \"pod\", \"$1\", \"pod\", \"(.*)\")) by (node, namespace, pod)\n",
for: 10m "record": "node_namespace_pod:kube_pod_info:"
labels: },
severity: warning {
annotations: "expr": "count by (node) (sum by (node, cpu) (\n node_cpu_seconds_total{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n))\n",
description: Pod {{$labels.namespaces}}/{{$labels.pod}} restarted {{$value}} "record": "node:node_num_cpu:sum"
times within the last hour },
summary: Pod is restarting frequently {
kubelet.rules.yaml: | "expr": "1 - avg(rate(node_cpu_seconds_total{job=\"node-exporter\",mode=\"idle\"}[1m]))\n",
groups: "record": ":node_cpu_utilisation:avg1m"
- name: kubelet.rules },
rules: {
- alert: K8SNodeNotReady "expr": "1 - avg by (node) (\n rate(node_cpu_seconds_total{job=\"node-exporter\",mode=\"idle\"}[1m])\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:)\n",
expr: kube_node_status_condition{condition="Ready",status="true"} == 0 "record": "node:node_cpu_utilisation:avg1m"
for: 1h },
labels: {
severity: warning "expr": "node:node_cpu_utilisation:avg1m\n *\nnode:node_num_cpu:sum\n /\nscalar(sum(node:node_num_cpu:sum))\n",
annotations: "record": "node:cluster_cpu_utilisation:ratio"
description: The Kubelet on {{ $labels.node }} has not checked in with the API, },
or has set itself to NotReady, for more than an hour {
summary: Node status is NotReady "expr": "sum(node_load1{job=\"node-exporter\"})\n/\nsum(node:node_num_cpu:sum)\n",
- alert: K8SManyNodesNotReady "record": ":node_cpu_saturation_load1:"
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0) },
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == {
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 "expr": "sum by (node) (\n node_load1{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n/\nnode:node_num_cpu:sum\n",
for: 1m "record": "node:node_cpu_saturation_load1:"
labels: },
severity: critical {
annotations: "expr": "1 -\nsum(node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n/\nsum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n",
description: '{{ $value }}% of Kubernetes nodes are not ready' "record": ":node_memory_utilisation:"
- alert: K8SKubeletDown },
expr: count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3 {
for: 1h "expr": "sum(node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n",
labels: "record": ":node_memory_MemFreeCachedBuffers_bytes:sum"
severity: warning },
annotations: {
description: Prometheus failed to scrape {{ $value }}% of kubelets. "expr": "sum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n",
- alert: K8SKubeletDown "record": ":node_memory_MemTotal_bytes:sum"
expr: (absent(up{job="kubelet"} == 1) or count(up{job="kubelet"} == 0) / count(up{job="kubelet"})) },
* 100 > 10 {
for: 1h "expr": "sum by (node) (\n (node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
labels: "record": "node:node_memory_bytes_available:sum"
severity: critical },
annotations: {
description: Prometheus failed to scrape {{ $value }}% of kubelets, or all Kubelets "expr": "sum by (node) (\n node_memory_MemTotal_bytes{job=\"node-exporter\"}\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
have disappeared from service discovery. "record": "node:node_memory_bytes_total:sum"
summary: Many Kubelets cannot be scraped },
- alert: K8SKubeletTooManyPods {
expr: kubelet_running_pod_count > 100 "expr": "(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)\n/\nnode:node_memory_bytes_total:sum\n",
for: 10m "record": "node:node_memory_utilisation:ratio"
labels: },
severity: warning {
annotations: "expr": "(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)\n/\nscalar(sum(node:node_memory_bytes_total:sum))\n",
description: Kubelet {{$labels.instance}} is running {{$value}} pods, close "record": "node:cluster_memory_utilisation:ratio"
to the limit of 110 },
summary: Kubelet is close to pod limit {
kubernetes.rules.yaml: | "expr": "1e3 * sum(\n (rate(node_vmstat_pgpgin{job=\"node-exporter\"}[1m])\n + rate(node_vmstat_pgpgout{job=\"node-exporter\"}[1m]))\n)\n",
groups: "record": ":node_memory_swap_io_bytes:sum_rate"
- name: kubernetes.rules },
rules: {
- record: pod_name:container_memory_usage_bytes:sum "expr": "1 -\nsum by (node) (\n (node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n/\nsum by (node) (\n node_memory_MemTotal_bytes{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY "record": "node:node_memory_utilisation:"
(pod_name) },
- record: pod_name:container_spec_cpu_shares:sum {
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name) "expr": "1 - (node:node_memory_bytes_available:sum / node:node_memory_bytes_total:sum)\n",
- record: pod_name:container_cpu_usage:sum "record": "node:node_memory_utilisation_2:"
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) },
BY (pod_name) {
- record: pod_name:container_fs_usage_bytes:sum "expr": "1e3 * sum by (node) (\n (rate(node_vmstat_pgpgin{job=\"node-exporter\"}[1m])\n + rate(node_vmstat_pgpgout{job=\"node-exporter\"}[1m]))\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name) "record": "node:node_memory_swap_io_bytes:sum_rate"
- record: namespace:container_memory_usage_bytes:sum },
expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace) {
- record: namespace:container_spec_cpu_shares:sum "expr": "avg(irate(node_disk_io_time_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]))\n",
expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace) "record": ":node_disk_utilisation:avg_irate"
- record: namespace:container_cpu_usage:sum },
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m])) {
BY (namespace) "expr": "avg by (node) (\n irate(node_disk_io_time_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m])\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
- record: cluster:memory_usage:ratio "record": "node:node_disk_utilisation:avg_irate"
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY },
(cluster) / sum(machine_memory_bytes) BY (cluster) {
- record: cluster:container_spec_cpu_shares:ratio "expr": "avg(irate(node_disk_io_time_weighted_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]) / 1e3)\n",
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000 "record": ":node_disk_saturation:avg_irate"
/ sum(machine_cpu_cores) },
- record: cluster:container_cpu_usage:ratio {
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) "expr": "avg by (node) (\n irate(node_disk_io_time_weighted_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]) / 1e3\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
/ sum(machine_cpu_cores) "record": "node:node_disk_saturation:avg_irate"
- record: apiserver_latency_seconds:quantile },
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / {
1e+06 "expr": "max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"}\n- node_filesystem_avail_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n/ node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n",
labels: "record": "node:node_filesystem_usage:"
quantile: "0.99" },
- record: apiserver_latency:quantile_seconds {
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / "expr": "max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"} / node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n",
1e+06 "record": "node:node_filesystem_avail:"
labels: },
quantile: "0.9" {
- record: apiserver_latency_seconds:quantile "expr": "sum(irate(node_network_receive_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m])) +\nsum(irate(node_network_transmit_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n",
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / "record": ":node_net_utilisation:sum_irate"
1e+06 },
labels: {
quantile: "0.5" "expr": "sum by (node) (\n (irate(node_network_receive_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]) +\n irate(node_network_transmit_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
- alert: APIServerLatencyHigh "record": "node:node_net_utilisation:sum_irate"
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} },
> 1 {
for: 10m "expr": "sum(irate(node_network_receive_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m])) +\nsum(irate(node_network_transmit_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n",
labels: "record": ":node_net_saturation:sum_irate"
severity: warning },
annotations: {
description: the API server has a 99th percentile latency of {{ $value }} seconds "expr": "sum by (node) (\n (irate(node_network_receive_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]) +\n irate(node_network_transmit_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
for {{$labels.verb}} {{$labels.resource}} "record": "node:node_net_saturation:sum_irate"
- alert: APIServerLatencyHigh },
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} {
> 4 "expr": "max(\n max(\n kube_pod_info{job=\"kube-state-metrics\", host_ip!=\"\"}\n ) by (node, host_ip)\n * on (host_ip) group_right (node)\n label_replace(\n (max(node_filesystem_files{job=\"node-exporter\", mountpoint=\"/\"}) by (instance)), \"host_ip\", \"$1\", \"instance\", \"(.*):.*\"\n )\n) by (node)\n",
for: 10m "record": "node:node_inodes_total:"
labels: },
severity: critical {
annotations: "expr": "max(\n max(\n kube_pod_info{job=\"kube-state-metrics\", host_ip!=\"\"}\n ) by (node, host_ip)\n * on (host_ip) group_right (node)\n label_replace(\n (max(node_filesystem_files_free{job=\"node-exporter\", mountpoint=\"/\"}) by (instance)), \"host_ip\", \"$1\", \"instance\", \"(.*):.*\"\n )\n) by (node)\n",
description: the API server has a 99th percentile latency of {{ $value }} seconds "record": "node:node_inodes_free:"
for {{$labels.verb}} {{$labels.resource}} }
- alert: APIServerErrorsHigh ]
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) },
* 100 > 2 {
for: 10m "name": "kubernetes-absent",
labels: "rules": [
severity: warning {
annotations: "alert": "KubeAPIDown",
description: API server returns errors for {{ $value }}% of requests "annotations": {
- alert: APIServerErrorsHigh "message": "KubeAPI has disappeared from Prometheus target discovery.",
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown"
* 100 > 5 },
for: 10m "expr": "absent(up{job=\"apiserver\"} == 1)\n",
labels: "for": "15m",
severity: critical "labels": {
annotations: "severity": "critical"
description: API server returns errors for {{ $value }}% of requests }
- alert: K8SApiserverDown },
expr: absent(up{job="apiserver"} == 1) {
for: 20m "alert": "KubeControllerManagerDown",
labels: "annotations": {
severity: critical "message": "KubeControllerManager has disappeared from Prometheus target discovery.",
annotations: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown"
description: No API servers are reachable or all have disappeared from service },
discovery "expr": "absent(up{job=\"kube-controller-manager\"} == 1)\n",
"for": "15m",
- alert: K8sCertificateExpirationNotice "labels": {
labels: "severity": "critical"
severity: warning }
annotations: },
description: Kubernetes API Certificate is expiring soon (less than 7 days) {
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="604800"}) > 0 "alert": "KubeSchedulerDown",
"annotations": {
- alert: K8sCertificateExpirationNotice "message": "KubeScheduler has disappeared from Prometheus target discovery.",
labels: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown"
severity: critical },
annotations: "expr": "absent(up{job=\"kube-scheduler\"} == 1)\n",
description: Kubernetes API Certificate is expiring in less than 1 day "for": "15m",
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="86400"}) > 0 "labels": {
node.rules.yaml: | "severity": "critical"
groups: }
- name: node.rules },
rules: {
- record: instance:node_cpu:rate:sum "alert": "KubeletDown",
expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m])) "annotations": {
BY (instance) "message": "Kubelet has disappeared from Prometheus target discovery.",
- record: instance:node_filesystem_usage:sum "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown"
expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})) },
BY (instance) "expr": "absent(up{job=\"kubelet\"} == 1)\n",
- record: instance:node_network_receive_bytes:rate:sum "for": "15m",
expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance) "labels": {
- record: instance:node_network_transmit_bytes:rate:sum "severity": "critical"
expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance) }
- record: instance:node_cpu:ratio }
expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance) ]
GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance) },
- record: cluster:node_cpu:sum_rate5m {
expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) "name": "kubernetes-apps",
- record: cluster:node_cpu:ratio "rules": [
expr: cluster:node_cpu:sum_rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu)) {
- alert: NodeExporterDown "alert": "KubePodCrashLooping",
expr: absent(up{job="node-exporter"} == 1) "annotations": {
for: 10m "message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} times / 5 minutes.",
labels: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping"
severity: warning },
annotations: "expr": "rate(kube_pod_container_status_restarts_total{job=\"kube-state-metrics\"}[15m]) * 60 * 5 > 0\n",
description: Prometheus could not scrape a node-exporter for more than 10m, "for": "1h",
or node-exporters have disappeared from discovery "labels": {
- alert: NodeDiskRunningFull "severity": "critical"
expr: predict_linear(node_filesystem_free_bytes[6h], 3600 * 24) < 0 }
for: 30m },
labels: {
severity: warning "alert": "KubePodNotReady",
annotations: "annotations": {
description: device {{$labels.device}} on node {{$labels.instance}} is running "message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.",
full within the next 24 hours (mounted at {{$labels.mountpoint}}) "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready"
- alert: NodeDiskRunningFull },
expr: predict_linear(node_filesystem_free_bytes[30m], 3600 * 2) < 0 "expr": "sum by (namespace, pod) (kube_pod_status_phase{job=\"kube-state-metrics\", phase=~\"Pending|Unknown\"}) > 0\n",
for: 10m "for": "1h",
labels: "labels": {
severity: critical "severity": "critical"
annotations: }
description: device {{$labels.device}} on node {{$labels.instance}} is running },
full within the next 2 hours (mounted at {{$labels.mountpoint}}) {
- alert: InactiveRAIDDisk "alert": "KubeDeploymentGenerationMismatch",
expr: node_md_disks - node_md_disks_active > 0 "annotations": {
for: 10m "message": "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.",
labels: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch"
severity: warning },
annotations: "expr": "kube_deployment_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_deployment_metadata_generation{job=\"kube-state-metrics\"}\n",
description: '{{$value}} RAID disk(s) on node {{$labels.instance}} are inactive' "for": "15m",
prometheus.rules.yaml: | "labels": {
groups: "severity": "critical"
- name: prometheus.rules }
rules: },
- alert: PrometheusConfigReloadFailed {
expr: prometheus_config_last_reload_successful == 0 "alert": "KubeDeploymentReplicasMismatch",
for: 10m "annotations": {
labels: "message": "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.",
severity: warning "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch"
annotations: },
description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}} "expr": "kube_deployment_spec_replicas{job=\"kube-state-metrics\"}\n !=\nkube_deployment_status_replicas_available{job=\"kube-state-metrics\"}\n",
- alert: PrometheusNotificationQueueRunningFull "for": "1h",
expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity "labels": {
for: 10m "severity": "critical"
labels: }
severity: warning },
annotations: {
description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{ "alert": "KubeStatefulSetReplicasMismatch",
$labels.pod}} "annotations": {
- alert: PrometheusErrorSendingAlerts "message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.",
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch"
> 0.01 },
for: 10m "expr": "kube_statefulset_status_replicas_ready{job=\"kube-state-metrics\"}\n !=\nkube_statefulset_status_replicas{job=\"kube-state-metrics\"}\n",
labels: "for": "15m",
severity: warning "labels": {
annotations: "severity": "critical"
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ }
$labels.pod}} to Alertmanager {{$labels.Alertmanager}} },
- alert: PrometheusErrorSendingAlerts {
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) "alert": "KubeStatefulSetGenerationMismatch",
> 0.03 "annotations": {
for: 10m "message": "StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.",
labels: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch"
severity: critical },
annotations: "expr": "kube_statefulset_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_statefulset_metadata_generation{job=\"kube-state-metrics\"}\n",
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ "for": "15m",
$labels.pod}} to Alertmanager {{$labels.Alertmanager}} "labels": {
- alert: PrometheusNotConnectedToAlertmanagers "severity": "critical"
expr: prometheus_notifications_alertmanagers_discovered < 1 }
for: 10m },
labels: {
severity: warning "alert": "KubeStatefulSetUpdateNotRolledOut",
annotations: "annotations": {
description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected "message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.",
to any Alertmanagers "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout"
- alert: PrometheusTSDBReloadsFailing },
expr: increase(prometheus_tsdb_reloads_failures_total[2h]) > 0 "expr": "max without (revision) (\n kube_statefulset_status_current_revision{job=\"kube-state-metrics\"}\n unless\n kube_statefulset_status_update_revision{job=\"kube-state-metrics\"}\n)\n *\n(\n kube_statefulset_replicas{job=\"kube-state-metrics\"}\n !=\n kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}\n)\n",
for: 12h "for": "15m",
labels: "labels": {
severity: warning "severity": "critical"
annotations: }
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} },
reload failures over the last four hours.' {
summary: Prometheus has issues reloading data blocks from disk "alert": "KubeDaemonSetRolloutStuck",
- alert: PrometheusTSDBCompactionsFailing "annotations": {
expr: increase(prometheus_tsdb_compactions_failed_total[2h]) > 0 "message": "Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.",
for: 12h "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck"
labels: },
severity: warning "expr": "kube_daemonset_status_number_ready{job=\"kube-state-metrics\"}\n /\nkube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"} * 100 < 100\n",
annotations: "for": "15m",
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} "labels": {
compaction failures over the last four hours.' "severity": "critical"
summary: Prometheus has issues compacting sample blocks }
- alert: PrometheusTSDBWALCorruptions },
expr: tsdb_wal_corruptions_total > 0 {
for: 4h "alert": "KubeDaemonSetNotScheduled",
labels: "annotations": {
severity: warning "message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.",
annotations: "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled"
description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead },
log (WAL).' "expr": "kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n -\nkube_daemonset_status_current_number_scheduled{job=\"kube-state-metrics\"} > 0\n",
summary: Prometheus write-ahead log is corrupted "for": "10m",
- alert: PrometheusNotIngestingSamples "labels": {
expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0 "severity": "warning"
for: 10m }
labels: },
severity: warning {
annotations: "alert": "KubeDaemonSetMisScheduled",
description: "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples." "annotations": {
summary: "Prometheus isn't ingesting samples" "message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled"
},
"expr": "kube_daemonset_status_number_misscheduled{job=\"kube-state-metrics\"} > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeCronJobRunning",
"annotations": {
"message": "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning"
},
"expr": "time() - kube_cronjob_next_schedule_time{job=\"kube-state-metrics\"} > 3600\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeJobCompletion",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion"
},
"expr": "kube_job_spec_completions{job=\"kube-state-metrics\"} - kube_job_status_succeeded{job=\"kube-state-metrics\"} > 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeJobFailed",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed"
},
"expr": "kube_job_status_failed{job=\"kube-state-metrics\"} > 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "kubernetes-resources",
"rules": [
{
"alert": "KubeCPUOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
},
"expr": "sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)\n /\nsum(node:node_num_cpu:sum)\n >\n(count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeMemOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
},
"expr": "sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)\n /\nsum(node_memory_MemTotal_bytes)\n >\n(count(node:node_num_cpu:sum)-1)\n /\ncount(node:node_num_cpu:sum)\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeCPUOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"requests.cpu\"})\n /\nsum(node:node_num_cpu:sum)\n > 1.5\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeMemOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"requests.memory\"})\n /\nsum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n > 1.5\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeQuotaExceeded",
"annotations": {
"message": "Namespace {{ $labels.namespace }} is using {{ printf \"%0.0f\" $value }}% of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded"
},
"expr": "100 * kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n > 90\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "CPUThrottlingHigh",
"annotations": {
"message": "{{ printf \"%0.0f\" $value }}% throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container_name }} in pod {{ $labels.pod_name }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh"
},
"expr": "100 * sum(increase(container_cpu_cfs_throttled_periods_total{container_name!=\"\", }[5m])) by (container_name, pod_name, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m])) by (container_name, pod_name, namespace)\n > 100 \n",
"for": "15m",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "kubernetes-storage",
"rules": [
{
"alert": "KubePersistentVolumeUsageCritical",
"annotations": {
"message": "The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ printf \"%0.2f\" $value }}% free.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeusagecritical"
},
"expr": "100 * kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\nkubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n < 3\n",
"for": "1m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubePersistentVolumeFullInFourDays",
"annotations": {
"message": "Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ printf \"%0.2f\" $value }}% is available.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefullinfourdays"
},
"expr": "100 * (\n kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\n kubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n) < 15\nand\npredict_linear(kubelet_volume_stats_available_bytes{job=\"kubelet\"}[6h], 4 * 24 * 3600) < 0\n",
"for": "5m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubePersistentVolumeErrors",
"annotations": {
"message": "The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors"
},
"expr": "kube_persistentvolume_status_phase{phase=~\"Failed|Pending\",job=\"kube-state-metrics\"} > 0\n",
"for": "5m",
"labels": {
"severity": "critical"
}
}
]
},
{
"name": "kubernetes-system",
"rules": [
{
"alert": "KubeNodeNotReady",
"annotations": {
"message": "{{ $labels.node }} has been unready for more than an hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready"
},
"expr": "kube_node_status_condition{job=\"kube-state-metrics\",condition=\"Ready\",status=\"true\"} == 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeVersionMismatch",
"annotations": {
"message": "There are {{ $value }} different semantic versions of Kubernetes components running.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch"
},
"expr": "count(count by (gitVersion) (label_replace(kubernetes_build_info{job!=\"coredns\"},\"gitVersion\",\"$1\",\"gitVersion\",\"(v[0-9]*.[0-9]*.[0-9]*).*\"))) > 1\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientErrors",
"annotations": {
"message": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }}% errors.'",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors"
},
"expr": "(sum(rate(rest_client_requests_total{code=~\"5..\"}[5m])) by (instance, job)\n /\nsum(rate(rest_client_requests_total[5m])) by (instance, job))\n* 100 > 1\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientErrors",
"annotations": {
"message": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }} errors / second.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors"
},
"expr": "sum(rate(ksm_scrape_error_total{job=\"kube-state-metrics\"}[5m])) by (instance, job) > 0.1\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletTooManyPods",
"annotations": {
"message": "Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close to the limit of 110.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods"
},
"expr": "kubelet_running_pod_count{job=\"kubelet\"} > 110 * 0.9\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "cluster_quantile:apiserver_request_latencies:histogram_quantile{job=\"apiserver\",quantile=\"0.99\",subresource!=\"log\",verb!~\"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$\"} > 1\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "cluster_quantile:apiserver_request_latencies:histogram_quantile{job=\"apiserver\",quantile=\"0.99\",subresource!=\"log\",verb!~\"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$\"} > 4\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value }}% of requests.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_count{job=\"apiserver\",code=~\"^(?:5..)$\"}[5m])) without(instance, pod)\n /\nsum(rate(apiserver_request_count{job=\"apiserver\"}[5m])) without(instance, pod) * 100 > 10\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value }}% of requests.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_count{job=\"apiserver\",code=~\"^(?:5..)$\"}[5m])) without(instance, pod)\n /\nsum(rate(apiserver_request_count{job=\"apiserver\"}[5m])) without(instance, pod) * 100 > 5\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 7 days.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
},
"expr": "histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 604800\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 24 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
},
"expr": "histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 86400\n",
"labels": {
"severity": "critical"
}
}
]
}
]
}
kubeprom.yaml: |-
{
"groups": [
{
"name": "kube-prometheus-node-recording.rules",
"rules": [
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[3m])) BY (instance)",
"record": "instance:node_cpu:rate:sum"
},
{
"expr": "sum((node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_free_bytes{mountpoint=\"/\"})) BY (instance)",
"record": "instance:node_filesystem_usage:sum"
},
{
"expr": "sum(rate(node_network_receive_bytes_total[3m])) BY (instance)",
"record": "instance:node_network_receive_bytes:rate:sum"
},
{
"expr": "sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)",
"record": "instance:node_network_transmit_bytes:rate:sum"
},
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance)",
"record": "instance:node_cpu:ratio"
},
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[5m]))",
"record": "cluster:node_cpu:sum_rate5m"
},
{
"expr": "cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu))",
"record": "cluster:node_cpu:ratio"
}
]
},
{
"name": "kube-prometheus-node-alerting.rules",
"rules": [
{
"alert": "NodeDiskRunningFull",
"annotations": {
"message": "Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 24 hours."
},
"expr": "(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0)\n",
"for": "30m",
"labels": {
"severity": "warning"
}
},
{
"alert": "NodeDiskRunningFull",
"annotations": {
"message": "Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 2 hours."
},
"expr": "(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)\n",
"for": "10m",
"labels": {
"severity": "critical"
}
}
]
},
{
"name": "prometheus.rules",
"rules": [
{
"alert": "PrometheusConfigReloadFailed",
"annotations": {
"description": "Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}",
"summary": "Reloading Prometheus' configuration failed"
},
"expr": "prometheus_config_last_reload_successful{job=\"prometheus\"} == 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusNotificationQueueRunningFull",
"annotations": {
"description": "Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{ $labels.pod}}",
"summary": "Prometheus' alert notification queue is running full"
},
"expr": "predict_linear(prometheus_notifications_queue_length{job=\"prometheus\"}[5m], 60 * 30) > prometheus_notifications_queue_capacity{job=\"prometheus\"}\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusErrorSendingAlerts",
"annotations": {
"description": "Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}",
"summary": "Errors while sending alert from Prometheus"
},
"expr": "rate(prometheus_notifications_errors_total{job=\"prometheus\"}[5m]) / rate(prometheus_notifications_sent_total{job=\"prometheus\"}[5m]) > 0.01\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusErrorSendingAlerts",
"annotations": {
"description": "Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}",
"summary": "Errors while sending alerts from Prometheus"
},
"expr": "rate(prometheus_notifications_errors_total{job=\"prometheus\"}[5m]) / rate(prometheus_notifications_sent_total{job=\"prometheus\"}[5m]) > 0.03\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "PrometheusNotConnectedToAlertmanagers",
"annotations": {
"description": "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected to any Alertmanagers",
"summary": "Prometheus is not connected to any Alertmanagers"
},
"expr": "prometheus_notifications_alertmanagers_discovered{job=\"prometheus\"} < 1\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBReloadsFailing",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} reload failures over the last four hours.",
"summary": "Prometheus has issues reloading data blocks from disk"
},
"expr": "increase(prometheus_tsdb_reloads_failures_total{job=\"prometheus\"}[2h]) > 0\n",
"for": "12h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBCompactionsFailing",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} compaction failures over the last four hours.",
"summary": "Prometheus has issues compacting sample blocks"
},
"expr": "increase(prometheus_tsdb_compactions_failed_total{job=\"prometheus\"}[2h]) > 0\n",
"for": "12h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBWALCorruptions",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead log (WAL).",
"summary": "Prometheus write-ahead log is corrupted"
},
"expr": "tsdb_wal_corruptions_total{job=\"prometheus\"} > 0\n",
"for": "4h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusNotIngestingSamples",
"annotations": {
"description": "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples.",
"summary": "Prometheus isn't ingesting samples"
},
"expr": "rate(prometheus_tsdb_head_samples_appended_total{job=\"prometheus\"}[5m]) <= 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTargetScrapesDuplicate",
"annotations": {
"description": "{{$labels.namespace}}/{{$labels.pod}} has many samples rejected due to duplicate timestamps but different values",
"summary": "Prometheus has many samples rejected"
},
"expr": "increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=\"prometheus\"}[5m]) > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "general.rules",
"rules": [
{
"alert": "TargetDown",
"annotations": {
"message": "{{ $value }}% of the {{ $labels.job }} targets are down."
},
"expr": "100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
}
]
}

View File

@ -14,7 +14,8 @@ kubectl port-forward grafana-POD-ID 8080 -n monitoring
Visit [127.0.0.1:8080](http://127.0.0.1:8080) to view the bundled dashboards. Visit [127.0.0.1:8080](http://127.0.0.1:8080) to view the bundled dashboards.
![Grafana Capacity Planning](../img/grafana-capacity.png) ![Grafana etcd](../img/grafana-etcd.png)
![Grafana Control Plane](../img/grafana-control-plane.png) ![Grafana resources cluster](../img/grafana-resources-cluster.png)
![Grafana Node View](../img/grafana-node.png) ![Grafana usage cluster](../img/grafana-usage-cluster.png)
![Grafana usage node](../img/grafana-usage-node.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 234 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 259 KiB

BIN
docs/img/grafana-etcd.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB