blob: 8aa6eb6fb77b482193d1f0ae1a4f9d0a5c8ccbdd [file] [log] [blame]
Andy Bavier0478ee52021-07-29 17:05:22 -07001..
2 SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org>
3 SPDX-License-Identifier: Apache-2.0
4
Larry Peterson0fa9b362023-08-09 15:15:13 -07005Monitoring and Alert Development
6================================
Andy Bavier0478ee52021-07-29 17:05:22 -07007
8Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect
9and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_
10to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to
Larry Petersonbd529682023-10-11 11:56:46 -070011notify Aether operators of events requiring attention. This monitoring stack is running on each Aether cluster.
Andy Bavier0478ee52021-07-29 17:05:22 -070012This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be
13collected and graphed, and can trigger alerts.
14
15
16Exporting Service Metrics to Prometheus
17---------------------------------------
Larry Petersonbd529682023-10-11 11:56:46 -070018
Andy Bavier0478ee52021-07-29 17:05:22 -070019An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_
20to expose its metrics to Prometheus. An exporter provides the current values of a components's
21metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics
22in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_
23are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available
24in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics
25into Prometheus format and export them.
26
27A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape
28this endpoint by defining a
29`ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_
30custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component.
31
32
33Working with Grafana Dashboards
34--------------------------------
Andy Bavier0478ee52021-07-29 17:05:22 -070035
Larry Petersonbd529682023-10-11 11:56:46 -070036Once the local cluster's Prometheus is collecting a component's
37metrics, they can be visualized using Grafana dashboards. The Grafana
38instance running on the AMP cluster is able to send queries to the
39Prometheus servers running on all Aether clusters. This means that
40component metrics can be visualized on the AMP Grafana regardless of
41where the component is actually running.
Andy Bavier0478ee52021-07-29 17:05:22 -070042
Larry Petersonbd529682023-10-11 11:56:46 -070043In order to create a new Grafana dashboard or modify an existing one,
44first login to the AMP Grafana using an account with admin privileges.
45To add a new dashboard, click the **+** at left. To make a copy of an
46existing dashboard for editing, click the **Dashboard Settings** icon
47(gear icon) at upper right of the existing dashboard, and then click
48the **Save as…** button at left.
Andy Bavier0478ee52021-07-29 17:05:22 -070049
Larry Petersonbd529682023-10-11 11:56:46 -070050Next, add panels to the dashboard. Since Grafana can access
51Prometheus on all the clusters in the environment, each cluster is
52available as a data source. For example, when adding a panel showing
53metrics collected on the ace-menlo cluster, choose ace-menlo as the
54data source.
Andy Bavier0478ee52021-07-29 17:05:22 -070055
Larry Petersonbd529682023-10-11 11:56:46 -070056Clicking on the floppy disk icon at top will save the dashboard
57*temporarily* (the dashboard is not saved to persistent storage and is
58deleted as soon as Grafana is restarted). To save the dashboard
59*permanently*, click the **Share Dashboard** icon next to the title
60and save its JSON to a file. Then add the file to the
61AMP submodule of OnRamp so that it will be deployed by Ansible:
62
63* Change to directory ``aeher-onramp/deps/amp/roles/monitor-load/templates/``
Andy Bavier0478ee52021-07-29 17:05:22 -070064* Copy the dashboard JSON file to the ``dashboards/`` sub-directory
65* Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator``
Andy Bavier0478ee52021-07-29 17:05:22 -070066
67Adding Service-specific Alerts
68------------------------------
Larry Petersonbd529682023-10-11 11:56:46 -070069
Andy Bavier0478ee52021-07-29 17:05:22 -070070An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager
Larry Petersonbd529682023-10-11 11:56:46 -070071then routes the alert to one or more receivers (e.g., an email address
72or Slack channel).
73
74.. note:: This section on alerts is specific to an operational
75 instantiation of Aether that is no supported. A port of this
76 capability to Aether OnRamp (so it is available to anyone
77 that wants to operate Aether) is pending.
Andy Bavier0478ee52021-07-29 17:05:22 -070078
79To add an alert for a component, create a
80`PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_
81custom resource, for example in the Helm chart that deploys the component. This resource describes one or
82more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions;
83if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule
84resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it.
85
86The Alertmanager is configured to send alerts with *critical* or *warning* severity to e-mail and Slack channels
87monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver
88(e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in
89a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository.
90To update the configuration:
91
92* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/``
93* Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file
94* Encode the ``alertmanager-config.yaml`` file as a Base64 string
95* Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string
96* Run the following command using a valid ``PUBLICKEY``:
97
98.. code-block:: shell
99
100 $ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml
101
102* Commit the changes and submit patchset to gerrit
103
104Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret
105by looking at the logs of the *sealed-secrets-controller* pod running on the cluster in the *kube-system* namespace.