Commit Graph

64 Commits

Author SHA1 Message Date
Dalton Hubble ec5aef5c92 Refresh Prometheus rules and Grafana dashboards
* Adds several network related alerts from upstream
2019-04-27 22:41:13 -07:00
Dalton Hubble 0e94708fd8 Update kube-state-metrics from v1.5.0 to v1.6.0-rc.2
* Collect metrics Ingress resources
* Collects metrics about certificates.k8s.io certificatesigningrequests
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.6.0-rc.2
2019-04-27 20:54:40 -07:00
Dalton Hubble 2c11bad439 Update Prometheus from v2.9.1 to v2.9.2
* https://github.com/prometheus/prometheus/releases/tag/v2.9.2
2019-04-27 20:39:55 -07:00
Dalton Hubble f3174c2b7a Update Prometheus from v2.8.1 to v2.9.1
* https://github.com/prometheus/prometheus/releases/tag/v2.9.1
* https://github.com/prometheus/prometheus/releases/tag/v2.9.0
2019-04-18 23:26:32 -07:00
Dalton Hubble 1b157a2fa4 Revert "Update kube-state-metrics from v1.5.0 to v1.6.0-rc.0"
* This reverts commit 6e5d66cf66
* kube-state-metrics v1.6.0-rc.0 fires KubeDeploymentReplicasMismatch
alerts where its own Deployment doesn't have replicas available,
(kube_deployment_status_replicas_available) even though all replicas
are available according to kubectl inspection
* This problem was present even with the CSR ClusterRole fix
(https://github.com/kubernetes/kube-state-metrics/pull/717)
2019-04-13 12:37:53 -07:00
Dalton Hubble 6e5d66cf66 Update kube-state-metrics from v1.5.0 to v1.6.0-rc.0
* Adds a metrics collector for Ingress resources and other
improvements
* https://github.com/kubernetes/kube-state-metrics/pull/640
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.6.0-rc.0
2019-04-09 22:16:36 -07:00
Dalton Hubble ce78d5988e Refresh Prometheus rules and Grafana dashboards
* Refresh rules and dashboards from upstreams
* Add new Kubernetes "workload" dashboards
  * View pods in a workload (deployment/daemonset/statefulset)
  * View workloads in a namespace
2019-04-06 23:31:44 -07:00
Dalton Hubble 3e9dc28a00 Update Prometheus from v2.8.0 to v2.8.1
* https://github.com/prometheus/prometheus/releases/tag/v2.8.1
2019-03-31 17:40:20 -07:00
Dalton Hubble 41a9d86bc3 Add NetworkPolicy to limit traffic into Prometheus
* Allow traffic from Grafana to Prometheus in monitoring
* Allow traffic from Prometheus to Prometheus in monitoring
* NetworkPolicy denies non-whitelisted traffic. Define policy
to allow other access
2019-03-23 21:38:34 -07:00
Dalton Hubble 36e31fc9fa Add liveness and readiness probes to Grafana
* https://github.com/grafana/grafana/issues/3302
2019-03-23 17:55:37 -07:00
Dalton Hubble 6dd2731046 Set cpu/memory resources requests/limits for some addons
* Set resource requests and limits for Grafana and CLUO
* Set resource requests for Prometheus, but allow usage
to grow since needs vary widely
* Leave nginx without resource requests/limits for now,
its typically well behaved
2019-03-20 00:15:08 -07:00
Dalton Hubble aa630003a4 Refresh Prometheus rules and Grafana dashboards
* Refresh rules and dashboards from upstreams
* Organize dashboards and stay below the ConfigMap size
limit
2019-03-17 13:23:04 -07:00
Dalton Hubble e0bee2e417 Update Prometheus from v2.7.2 to v2.8.0
* https://github.com/prometheus/prometheus/releases/tag/v2.8.0
2019-03-13 22:11:38 -07:00
Dalton Hubble 4d9a692424 Update Prometheus from v2.7.1 to v2.7.2
* https://github.com/prometheus/prometheus/releases/tag/v2.7.2
2019-03-04 23:08:12 -08:00
Dalton Hubble e483c81ce9 Improve Prometheus rules and alerts and Grafana dashboards
* Collate upstream rules, alerts, and dashboards and tune for use
in Typhoon
* Previously, a well-chosen (but older) set of rules, alerts, and
dashboards were maintained to reflect metric name changes
2019-02-18 12:19:23 -08:00
Dalton Hubble b13a651cfe Drop metrics that are unset, high cardinality, or extraneous
* https://github.com/coreos/prometheus-operator/pull/2387
* https://github.com/coreos/prometheus-operator/pull/1959
2019-02-10 23:56:11 -08:00
Dalton Hubble 9c59f393a5 Add Kubernetes pod name to metrics discovered from service endpoints
* Prometheus queries from some upstreams use joins of node-exporter
and kube-state-metrics metrics by (namespace,pod). Add the Kubernetes
pod name to service endpoint metrics
* Rename the kubernetes_namespace field to namespace
* Honor labels since kube-state-metrics already include a `pod` field
that should not be overridden
2019-02-10 23:54:30 -08:00
Dalton Hubble 949ce21fb2 Update Prometheus from v2.7.0 to v2.7.1
* https://github.com/prometheus/prometheus/releases/tag/v2.7.1
2019-02-02 00:13:24 -08:00
Dalton Hubble 130daeac26 Update Prometheus from v2.6.1 to v2.7.0 2019-01-29 22:31:20 -08:00
Dalton Hubble f5ff003d0e Update node-exporter from v0.15.2 to v0.17.0
* node-exporter renamed multiple metrics that are reflected
in changes to Prometheus rules and Grafana dashboard expressions
2019-01-22 01:14:00 -08:00
Dalton Hubble d697dd46dc Allow kube-state-metrics PodDisruptionBudget metrics
* Update kube-state-metrics ClusterRole to allow collecting
poddisruptionbudget metrics (exported as kube_poddisruptionbudget_*)
* https://github.com/kubernetes/kube-state-metrics/pull/551
* Bump addon-resizer from v1.7 to v1.8.4
2019-01-22 01:12:32 -08:00
Dalton Hubble 67fb9602e7 Update Prometheus from v2.6.0 to v2.6.1
* https://github.com/prometheus/prometheus/releases/tag/v2.6.1
2019-01-15 21:13:40 -08:00
Dalton Hubble 1d27dc6528 Update kube-state-metrics exporter from v1.4.0 to v1.5.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.5.0
2019-01-12 14:24:57 -08:00
Dalton Hubble ea8b0d1c84 Update Prometheus addon from v2.5.0 to v2.6.0
* https://github.com/prometheus/prometheus/releases/tag/v2.6.0
2018-12-27 07:35:12 -08:00
Dalton Hubble 7de03a1279 Fix Prometheus etcd scrape config for DigitalOcean
* Kubelet uses a node's hostname as the node name, which isn't
resolvable on DigitalOcean. On DigitalOcean, the node name was
set to the internal IP until #337 switched to instead configuring
kube-apiserver to prefer the InternalIP for communication
* Explicitly configure etcd scrapes to target each controller by
internal IP and port 2381 (replace __address__)
2018-11-06 23:02:45 -08:00
Dalton Hubble be9f7b87d6 Update Prometheus from v2.4.3 to v2.5.0
* https://github.com/prometheus/prometheus/releases/tag/v2.5.0
2018-11-06 22:16:12 -08:00
Dalton Hubble a10d6977b8 Update Prometheus from v2.4.2 to v2.4.3
* https://github.com/prometheus/prometheus/releases/tag/v2.4.3
2018-10-16 21:29:41 -07:00
Dalton Hubble 5eb4078d68 Add docker/default seccomp to control plane and addons
* Annotate pods, deployments, and daemonsets to start containers
with the Docker runtime's default seccomp profile
* Overrides Kubernetes default behavior which started containers
with seccomp=unconfined
* https://docs.docker.com/engine/security/seccomp/#pass-a-profile-for-a-container
2018-10-16 20:07:29 -07:00
Dalton Hubble 032a24133b Update Prometheus from v2.3.2 to v2.4.2
* https://github.com/prometheus/prometheus/releases/tag/v2.4.0
* https://github.com/prometheus/prometheus/releases/tag/v2.4.1
* https://github.com/prometheus/prometheus/releases/tag/v2.4.2
2018-09-21 22:27:11 -07:00
Dalton Hubble 4ba090feb0 Update kube-state-metrics from v1.3.1 to v1.4.0 2018-08-29 09:37:50 -07:00
Becca Powell 49a9dc9b8b Fix typo in Prometheus alerting rules 2018-08-21 16:55:49 -07:00
Dalton Hubble 02cd8eb8d3 Update Prometheus from v2.3.1 to v2.3.2
* https://github.com/prometheus/prometheus/releases/tag/v2.3.2
2018-07-14 14:25:49 -07:00
Dalton Hubble 84d6cfe7b3 Add Prometheus alert rule for inactive md devices
* node-exporter exposes metrics to Prometheus about total and
active md devices (e.g. disks in mdadm RAID arrays)
* Add alert that fires when a RAID disk fails or becomes inactive
for another reason
2018-07-10 00:20:30 -07:00
Dalton Hubble 05b99178ae Update prometheus from v2.3.0 to v2.3.1
* https://github.com/prometheus/prometheus/releases/tag/v2.3.1
2018-06-19 21:43:50 -07:00
Dalton Hubble cbe646fba6 Label namespaces to ease writing Network Policies 2018-06-09 11:45:11 -07:00
Dalton Hubble c166b2ba33 Update prometheus from v2.2.1 to v2.3.0 2018-06-09 11:43:10 -07:00
Dalton Hubble 32a9a83190 Add Prometheus liveness and readiness probes 2018-05-30 22:34:07 -07:00
Dalton Hubble c2b719dc75 Configure Prometheus to scrape Kubelets directly
* Use Kubelet bearer token authn/authz to scrape metrics
* Drop RBAC permission from nodes/proxy to nodes/metrics
* Stop proxying kubelet scrapes through the apiserver, since
this required higher privilege (nodes/proxy) and can add
load to the apiserver on large clusters
2018-05-14 23:06:50 -07:00
Dalton Hubble a54e3c0da1 Fix Prometheus data dir to /var/lib/prometheus
* A data volume (emptyDir) is mounted to /var/lib/prometheus
* Users could swap emptyDir for any desired volume if data
persistence is desired. Prometheus previously defaulted to
keeping its data in ./data relative to /prometheus. Override
this behavior to store data in /var/lib/prometheus
2018-05-01 22:05:27 -07:00
Dalton Hubble 9789881243 Update kube-state-metrics from v1.3.0 to v1.3.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.3.1
2018-04-15 17:10:02 -07:00
Dalton Hubble 6b08bde479 Use k8s.gcr.io instead of gcr.io/google_containers
* Kubernetes recommends using the alias to fetch images
from the nearest GCR regional mirror, to abstract the use
of GCR, and to drop names containing 'google'
* https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ
2018-04-08 12:57:52 -07:00
Dalton Hubble f4b2396718 Return Prometheus deployment to be a worker workload
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to https://github.com/poseidon/typhoon/pull/175
2018-04-08 12:20:00 -07:00
Dalton Hubble 7186aa46da Update kube-state-metrics from v1.2.0 to v1.3.0
* https://github.com/kubernetes/kube-state-metrics/pull/412
* https://github.com/kubernetes/kube-state-metrics/pull/413
2018-04-04 21:04:13 -07:00
Dalton Hubble d770393dbc Add etcd metrics, Prometheus scrapes, and Grafana dash
* Use etcd v3.3 --listen-metrics-urls to expose only metrics
data via http://0.0.0.0:2381 on controllers
* Add Prometheus discovery for etcd peers on controller nodes
* Temporarily drop two noisy Prometheus alerts
2018-04-03 20:31:00 -07:00
Dalton Hubble 46226a8015 Update Prometheus from 2.2.0 to 2.2.1 2018-03-18 15:56:44 -07:00
Dalton Hubble 42708f9a70 Update Prometheus from v2.2.0-rc.1 to v2.2.0
* https://github.com/prometheus/prometheus/releases/tag/v2.2.0
2018-03-09 00:20:40 -08:00
Dalton Hubble 9307e97c46 addons: Update Prometheus from v2.1.0 to v2.2.0
* Annotate Prometheus service to scrape metrics from
Prometheus itself (enables Prometheus* alerts)
* Update kube-state-metrics addon-resizer to 1.7
* Use port 8080 for kube-state-metrics
* Add PrometheusNotIngestingSamples alert rule
* Change K8SKubeletDown alert rule to fire when 10%
of kubelets are down, not 1%
  * https://github.com/coreos/prometheus-operator/pull/1032
2018-03-09 00:20:40 -08:00
Paul Saunders 86420fd507 Rename namespace manifests to be applied first
* Ensure kubectl apply -R creates manifests in the right order
2018-02-22 01:04:30 -08:00
Dalton Hubble 2c10d24113 addons: Switch to apps/v1 workload APIs
* Deployments now belong to the apps/v1 API group
* DaemonSets now belong to the apps/v1 API group
* RBAC types now belong to the rbac.authorization.k8s.io/v1 API group
2018-02-10 23:56:31 -08:00
Dalton Hubble 064ce83f25 addons: Update Prometheus to v2.1.0
* Change service discovery to relabel jobs to align with
rule expressions in upstream examples
* Use a separate service account for prometheus instead
of granting roles to the namespace's default
* Use a separate service account for node-exporter
* Update node-exporter and kube-state-metrics exporters
2018-01-27 21:00:15 -08:00