typhoon

Commit Graph

Author	SHA1	Message	Date
Dalton Hubble	c4683c5bad	Refresh Prometheus alerts and Grafana dashboards * Add 2 min wait before KubeNodeUnreachable to be less noisy on premeptible clusters * Add a BlackboxProbeFailure alert for any failing probes for services annotated `prometheus.io/probe: true`	2020-03-02 20:08:37 -08:00
Dalton Hubble	f4d260645c	Update node-exporter from v0.18.1 to v1.0.0-rc.0 * Update mdadm alert rule; node-exporter adds `state` label to `node_md_disks` and removes `node_md_disks_active` * https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.0	2020-02-25 22:29:52 -08:00
Dalton Hubble	525ae23305	Add node-exporter alerts and Grafana dashboard * Add Prometheus alerts from node-exporter * Add Grafana dashboard nodes.json, from node-exporter * Not adding recording rules, since those are only used by some node-exporter USE dashboards not being included	2019-11-16 13:47:20 -08:00
Dalton Hubble	d4573092b5	Improve Kubelet and Compute Resource dashboards * Add cluster filter to Kubelet dashboard * Add network details in resource dashboards * https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/275 * https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/284 * https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/285	2019-10-28 02:22:15 -07:00
Dalton Hubble	68da420adc	Refresh Prometheus rules/alerts and Grafana dashboards * Update Prometheus rules/alerts and Grafana dashboards * Remove dashboards that were moved to node-exporter, they may be added back later if valuable * Remove kube-prometheus based rules/alerts (ClockSkew alert)	2019-10-19 17:43:47 -07:00
Dalton Hubble	68d8717924	Refresh Prometheus rules/alerts and Grafana dashboards * Refresh rules, alerts, and dashboards from upstreams	2019-07-21 11:29:34 -07:00
Dalton Hubble	ec5aef5c92	Refresh Prometheus rules and Grafana dashboards * Adds several network related alerts from upstream	2019-04-27 22:41:13 -07:00
Dalton Hubble	ce78d5988e	Refresh Prometheus rules and Grafana dashboards * Refresh rules and dashboards from upstreams * Add new Kubernetes "workload" dashboards * View pods in a workload (deployment/daemonset/statefulset) * View workloads in a namespace	2019-04-06 23:31:44 -07:00
Dalton Hubble	aa630003a4	Refresh Prometheus rules and Grafana dashboards * Refresh rules and dashboards from upstreams * Organize dashboards and stay below the ConfigMap size limit	2019-03-17 13:23:04 -07:00
Dalton Hubble	e483c81ce9	Improve Prometheus rules and alerts and Grafana dashboards * Collate upstream rules, alerts, and dashboards and tune for use in Typhoon * Previously, a well-chosen (but older) set of rules, alerts, and dashboards were maintained to reflect metric name changes	2019-02-18 12:19:23 -08:00
Dalton Hubble	f5ff003d0e	Update node-exporter from v0.15.2 to v0.17.0 * node-exporter renamed multiple metrics that are reflected in changes to Prometheus rules and Grafana dashboard expressions	2019-01-22 01:14:00 -08:00
Becca Powell	49a9dc9b8b	Fix typo in Prometheus alerting rules	2018-08-21 16:55:49 -07:00
Dalton Hubble	84d6cfe7b3	Add Prometheus alert rule for inactive md devices * node-exporter exposes metrics to Prometheus about total and active md devices (e.g. disks in mdadm RAID arrays) * Add alert that fires when a RAID disk fails or becomes inactive for another reason	2018-07-10 00:20:30 -07:00
Dalton Hubble	d770393dbc	Add etcd metrics, Prometheus scrapes, and Grafana dash * Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts	2018-04-03 20:31:00 -07:00
Dalton Hubble	9307e97c46	addons: Update Prometheus from v2.1.0 to v2.2.0 * Annotate Prometheus service to scrape metrics from Prometheus itself (enables Prometheus* alerts) * Update kube-state-metrics addon-resizer to 1.7 * Use port 8080 for kube-state-metrics * Add PrometheusNotIngestingSamples alert rule * Change K8SKubeletDown alert rule to fire when 10% of kubelets are down, not 1% * https://github.com/coreos/prometheus-operator/pull/1032	2018-03-09 00:20:40 -08:00
Dalton Hubble	064ce83f25	addons: Update Prometheus to v2.1.0 * Change service discovery to relabel jobs to align with rule expressions in upstream examples * Use a separate service account for prometheus instead of granting roles to the namespace's default * Use a separate service account for node-exporter * Update node-exporter and kube-state-metrics exporters	2018-01-27 21:00:15 -08:00
Dalton Hubble	65f006e6cc	addons: Sync prometheus alerts to upstream * https://github.com/coreos/prometheus-operator/pull/774	2017-12-01 23:24:08 -08:00
Dalton Hubble	63ab117205	addons: Add prometheus rules for DaemonSets * https://github.com/coreos/prometheus-operator/pull/755	2017-11-16 23:51:21 -08:00
Dalton Hubble	1cd262e712	addons: Fix prometheus K8SApiServerLatency alert rule * https://github.com/coreos/prometheus-operator/issues/751	2017-11-16 23:37:15 -08:00
Dalton Hubble	159443bae7	addons: Add better alerting rules to Prometheus manifests * Adapt the coreos/prometheus-operator alerting rules for Typhoon, https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus/manifests * Add controller manager and scheduler shim services to let prometheus discover them via service endpoints * Fix several alert rules to use service endpoint discovery * A few rules still don't do much, but they default to green	2017-11-10 20:57:47 -08:00
Dalton Hubble	d046d45769	addons: Include Prometheus and node-exporter manifests	2017-10-24 22:58:59 -07:00

21 Commits