Blame - developer/monitoring.rst - aether-docs

blob: 8aa6eb6fb77b482193d1f0ae1a4f9d0a5c8ccbdd [file] [log] [blame]

Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	1	..
				2	SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org>
				3	SPDX-License-Identifier: Apache-2.0
				4
Larry Peterson	0fa9b36	2023-08-09 15:15:13 -0700	[diff] [blame]	5	Monitoring and Alert Development
				6	================================
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	7
				8	Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect
				9	and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_
				10	to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	11	notify Aether operators of events requiring attention. This monitoring stack is running on each Aether cluster.
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	12	This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be
				13	collected and graphed, and can trigger alerts.
				14
				15
				16	Exporting Service Metrics to Prometheus
				17	---------------------------------------
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	18
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	19	An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_
				20	to expose its metrics to Prometheus. An exporter provides the current values of a components's
				21	metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics
				22	in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_
				23	are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available
				24	in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics
				25	into Prometheus format and export them.
				26
				27	A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape
				28	this endpoint by defining a
				29	`ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_
				30	custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component.
				31
				32
				33	Working with Grafana Dashboards
				34	--------------------------------
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	35
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	36	Once the local cluster's Prometheus is collecting a component's
				37	metrics, they can be visualized using Grafana dashboards. The Grafana
				38	instance running on the AMP cluster is able to send queries to the
				39	Prometheus servers running on all Aether clusters. This means that
				40	component metrics can be visualized on the AMP Grafana regardless of
				41	where the component is actually running.
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	42
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	43	In order to create a new Grafana dashboard or modify an existing one,
				44	first login to the AMP Grafana using an account with admin privileges.
				45	To add a new dashboard, click the + at left. To make a copy of an
				46	existing dashboard for editing, click the Dashboard Settings icon
				47	(gear icon) at upper right of the existing dashboard, and then click
				48	the Save as… button at left.
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	49
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	50	Next, add panels to the dashboard. Since Grafana can access
				51	Prometheus on all the clusters in the environment, each cluster is
				52	available as a data source. For example, when adding a panel showing
				53	metrics collected on the ace-menlo cluster, choose ace-menlo as the
				54	data source.
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	55
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	56	Clicking on the floppy disk icon at top will save the dashboard
				57	temporarily (the dashboard is not saved to persistent storage and is
				58	deleted as soon as Grafana is restarted). To save the dashboard
				59	permanently, click the Share Dashboard icon next to the title
				60	and save its JSON to a file. Then add the file to the
				61	AMP submodule of OnRamp so that it will be deployed by Ansible:
				62
				63	* Change to directory ``aeher-onramp/deps/amp/roles/monitor-load/templates/``
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	64	* Copy the dashboard JSON file to the ``dashboards/`` sub-directory
				65	* Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator``
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	66
				67	Adding Service-specific Alerts
				68	------------------------------
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	69
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	70	An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager
Larry Peterson	bd52968	2023-10-11 11:56:46 -0700	[diff] [blame^]	71	then routes the alert to one or more receivers (e.g., an email address
				72	or Slack channel).
				73
				74	.. note:: This section on alerts is specific to an operational
				75	instantiation of Aether that is no supported. A port of this
				76	capability to Aether OnRamp (so it is available to anyone
				77	that wants to operate Aether) is pending.
Andy Bavier	0478ee5	2021-07-29 17:05:22 -0700	[diff] [blame]	78
				79	To add an alert for a component, create a
				80	`PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_
				81	custom resource, for example in the Helm chart that deploys the component. This resource describes one or
				82	more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions;
				83	if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule
				84	resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it.
				85
				86	The Alertmanager is configured to send alerts with critical or warning severity to e-mail and Slack channels
				87	monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver
				88	(e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in
				89	a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository.
				90	To update the configuration:
				91
				92	* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/``
				93	* Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file
				94	* Encode the ``alertmanager-config.yaml`` file as a Base64 string
				95	* Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string
				96	* Run the following command using a valid ``PUBLICKEY``:
				97
				98	.. code-block:: shell
				99
				100	$ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml
				101
				102	* Commit the changes and submit patchset to gerrit
				103
				104	Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret
				105	by looking at the logs of the sealed-secrets-controller pod running on the cluster in the kube-system namespace.