typhoon

Commit Graph

Author	SHA1	Message	Date
Dalton Hubble	979c092ef6	Reduce apiserver metrics cardinality of non-core APIs * Reduce `apiserver_request_duration_seconds_count` cardinality by dropping series for non-core Kubernetes APIs. This is done to match `apiserver_request_duration_seconds_count` relabeling * These two relabels must be performed the same way to avoid affecting new SLO calculations (upcoming) * See https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/498 Related: https://github.com/poseidon/typhoon/pull/596	2020-09-13 14:47:49 -07:00
Dalton Hubble	eb093af9ed	Drop Kubelet labelmap relabel for node_name * Originally, Kubelet and CAdvisor metrics used a labelmap relabel to add Kubernetes SD node labels onto timeseries * With https://github.com/poseidon/typhoon/pull/596 that relabel was dropped since node labels aren't usually that valuable. `__meta_kubernetes_node_name` was retained but the field name is empty * Favor just using Prometheus server-side `instance` in queries that require some node identifier for aggregation or debugging Fix https://github.com/poseidon/typhoon/issues/823	2020-09-12 19:40:00 -07:00
Dalton Hubble	178afe4a9b	Reduce apiserver metrics cardinality and extraneous labels * Stop mapping node labels to targets discovered via Kubernetes nodes (e.g. etcd, kubelet, cadvisor). It is rarely useful to store node labels (e.g. kubernetes.io/os=linux) on these metrics * kube-apiserver's apiserver_request_duration_seconds_bucket metric has a high cardinality that includes labels for the API group, verb, scope, resource, and component for each object type, including for each CRD. This one metric has ~10k time series in a typical cluster (btw 10-40% of total) * Removing the apiserver request duration outright would make latency alerts a NoOp and break a Grafana apiserver panel. Instead, drop series that have a "group" label. Effectively, only request durations for core Kubernetes APIs will be kept (e.g. cardinality won't grow with each CRD added). This reduces the metric to ~2k unique series	2019-12-08 22:48:25 -08:00
Dalton Hubble	19de38b30d	Fix Prometheus etcd metrics scraping * Prometheus was configured to use kubernetes discovery of etcd targets based on nodes matching the node label node-role.kubernetes.io/controller=true * Kubernetes v1.16 stopped permitting node role labels node-role.kubernetes.io/* so Typhoon renamed these labels (no longer any association with roles) to node.kubermetes.io/controller=true * As a result, Prometheus didn't discover etcd targets, etcd metrics were missing, etcd alerts were ineffective, and the etcd Grafana dashboard was empty * Introduced: https://github.com/poseidon/typhoon/pull/543	2019-10-03 19:07:05 -07:00
Dalton Hubble	b13a651cfe	Drop metrics that are unset, high cardinality, or extraneous * https://github.com/coreos/prometheus-operator/pull/2387 * https://github.com/coreos/prometheus-operator/pull/1959	2019-02-10 23:56:11 -08:00
Dalton Hubble	9c59f393a5	Add Kubernetes pod name to metrics discovered from service endpoints * Prometheus queries from some upstreams use joins of node-exporter and kube-state-metrics metrics by (namespace,pod). Add the Kubernetes pod name to service endpoint metrics * Rename the kubernetes_namespace field to namespace * Honor labels since kube-state-metrics already include a `pod` field that should not be overridden	2019-02-10 23:54:30 -08:00
Dalton Hubble	7de03a1279	Fix Prometheus etcd scrape config for DigitalOcean * Kubelet uses a node's hostname as the node name, which isn't resolvable on DigitalOcean. On DigitalOcean, the node name was set to the internal IP until #337 switched to instead configuring kube-apiserver to prefer the InternalIP for communication * Explicitly configure etcd scrapes to target each controller by internal IP and port 2381 (replace __address__)	2018-11-06 23:02:45 -08:00
Dalton Hubble	c2b719dc75	Configure Prometheus to scrape Kubelets directly * Use Kubelet bearer token authn/authz to scrape metrics * Drop RBAC permission from nodes/proxy to nodes/metrics * Stop proxying kubelet scrapes through the apiserver, since this required higher privilege (nodes/proxy) and can add load to the apiserver on large clusters	2018-05-14 23:06:50 -07:00
Dalton Hubble	d770393dbc	Add etcd metrics, Prometheus scrapes, and Grafana dash * Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts	2018-04-03 20:31:00 -07:00
Dalton Hubble	064ce83f25	addons: Update Prometheus to v2.1.0 * Change service discovery to relabel jobs to align with rule expressions in upstream examples * Use a separate service account for prometheus instead of granting roles to the namespace's default * Use a separate service account for node-exporter * Update node-exporter and kube-state-metrics exporters	2018-01-27 21:00:15 -08:00
Dalton Hubble	f570af9418	addons: Update from Prometheus v1.8.2 to v2.0.0	2017-11-08 22:48:23 -08:00
Dalton Hubble	b608f9c615	addons: Use service endpoints to scrape node-exporter	2017-10-24 22:59:00 -07:00
Dalton Hubble	d046d45769	addons: Include Prometheus and node-exporter manifests	2017-10-24 22:58:59 -07:00

13 Commits