Blame - amp/monitoring.rst - aether-docs

blob: 7ce3dd57034262c812881fa823ed35fce387b351 [file] [log] [blame]

Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	1	..
				2	SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org>
				3	SPDX-License-Identifier: Apache-2.0
				4
Scott Baker	3c7cfea	2022-03-09 16:22:42 -0800	[diff] [blame^]	5	Monitoring and Alert Infrastructure
				6	===================================
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	7
				8	Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect
				9	and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_
				10	to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to
				11	notify Aether OPs staff of events requiring attention. This monitoring stack is running on each Aether cluster.
				12	This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be
				13	collected and graphed, and can trigger alerts.
				14
				15
				16	Exporting Service Metrics to Prometheus
				17	---------------------------------------
				18	An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_
				19	to expose its metrics to Prometheus. An exporter provides the current values of a components's
				20	metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics
				21	in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_
				22	are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available
				23	in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics
				24	into Prometheus format and export them.
				25
				26	A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape
				27	this endpoint by defining a
				28	`ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_
				29	custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component.
				30
				31
				32	Working with Grafana Dashboards
				33	--------------------------------
				34	Once the local cluster's Prometheus is collecting a component's metrics, they can be visualized using Grafana
				35	dashboards. The Grafana instance running on the AMP cluster is able to send queries to the Prometheus
				36	servers running on all Aether clusters. This means that component metrics can be visualized on the AMP Grafana
				37	regardless of where the component is actually running.
				38
				39	In order to create a new Grafana dashboard or modify an existing one, first login to the AMP Grafana using an account
				40	with admin privileges. To add a new dashboard, click the + at left. To make a copy of an existing dashboard for
				41	editing, click the Dashboard Settings icon (gear icon) at upper right of the existing dashboard, and then
				42	click the Save as… button at left.
				43
				44	Next, add panels to the dashboard. Since Grafana can access Prometheus on all the clusters in the environment,
				45	each cluster is available as a data source. For example, when adding a panel showing metrics collected on the
				46	ace-menlo cluster, choose ace-menlo as the data source.
				47
				48	Clicking on the floppy disk icon at top will save the dashboard temporarily (the dashboard is not
				49	saved to persistent storage and is deleted as soon as Grafana is restarted). To save the dashboard permanently,
				50	click the Share Dashboard icon next to the title and save its JSON to a file. Then add the file to the
				51	aether-app-configs repository so that it will be deployed by Fleet:
				52
				53	* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<amp-cluster>/``
				54	* Copy the dashboard JSON file to the ``dashboards/`` sub-directory
				55	* Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator``
				56	* Commit the changes and submit patchset to gerrit
				57
				58	Once the patchset is merged, the AMP Grafana will automatically detect and deploy the new dashboard.
				59
				60	Adding Service-specific Alerts
				61	------------------------------
				62	An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager
				63	then routes the alert to one or more receivers (e.g., an email address or Slack channel).
				64
				65	To add an alert for a component, create a
				66	`PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_
				67	custom resource, for example in the Helm chart that deploys the component. This resource describes one or
				68	more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions;
				69	if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule
				70	resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it.
				71
				72	The Alertmanager is configured to send alerts with critical or warning severity to e-mail and Slack channels
				73	monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver
				74	(e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in
				75	a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository.
				76	To update the configuration:
				77
				78	* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/``
				79	* Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file
				80	* Encode the ``alertmanager-config.yaml`` file as a Base64 string
				81	* Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string
				82	* Run the following command using a valid ``PUBLICKEY``:
				83
				84	.. code-block:: shell
				85
				86	$ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml
				87
				88	* Commit the changes and submit patchset to gerrit
				89
				90	Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret
				91	by looking at the logs of the sealed-secrets-controller pod running on the cluster in the kube-system namespace.