Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 1 | .. |
| 2 | SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org> |
| 3 | SPDX-License-Identifier: Apache-2.0 |
| 4 | |
Larry Peterson | 0fa9b36 | 2023-08-09 15:15:13 -0700 | [diff] [blame] | 5 | Monitoring and Alert Development |
| 6 | ================================ |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 7 | |
| 8 | Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect |
| 9 | and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_ |
| 10 | to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 11 | notify Aether operators of events requiring attention. This monitoring stack is running on each Aether cluster. |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 12 | This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be |
| 13 | collected and graphed, and can trigger alerts. |
| 14 | |
| 15 | |
| 16 | Exporting Service Metrics to Prometheus |
| 17 | --------------------------------------- |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 18 | |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 19 | An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_ |
| 20 | to expose its metrics to Prometheus. An exporter provides the current values of a components's |
| 21 | metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics |
| 22 | in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_ |
| 23 | are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available |
| 24 | in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics |
| 25 | into Prometheus format and export them. |
| 26 | |
| 27 | A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape |
| 28 | this endpoint by defining a |
| 29 | `ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_ |
| 30 | custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component. |
| 31 | |
| 32 | |
| 33 | Working with Grafana Dashboards |
| 34 | -------------------------------- |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 35 | |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 36 | Once the local cluster's Prometheus is collecting a component's |
| 37 | metrics, they can be visualized using Grafana dashboards. The Grafana |
| 38 | instance running on the AMP cluster is able to send queries to the |
| 39 | Prometheus servers running on all Aether clusters. This means that |
| 40 | component metrics can be visualized on the AMP Grafana regardless of |
| 41 | where the component is actually running. |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 42 | |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 43 | In order to create a new Grafana dashboard or modify an existing one, |
| 44 | first login to the AMP Grafana using an account with admin privileges. |
| 45 | To add a new dashboard, click the **+** at left. To make a copy of an |
| 46 | existing dashboard for editing, click the **Dashboard Settings** icon |
| 47 | (gear icon) at upper right of the existing dashboard, and then click |
| 48 | the **Save as…** button at left. |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 49 | |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 50 | Next, add panels to the dashboard. Since Grafana can access |
| 51 | Prometheus on all the clusters in the environment, each cluster is |
| 52 | available as a data source. For example, when adding a panel showing |
| 53 | metrics collected on the ace-menlo cluster, choose ace-menlo as the |
| 54 | data source. |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 55 | |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 56 | Clicking on the floppy disk icon at top will save the dashboard |
| 57 | *temporarily* (the dashboard is not saved to persistent storage and is |
| 58 | deleted as soon as Grafana is restarted). To save the dashboard |
| 59 | *permanently*, click the **Share Dashboard** icon next to the title |
| 60 | and save its JSON to a file. Then add the file to the |
| 61 | AMP submodule of OnRamp so that it will be deployed by Ansible: |
| 62 | |
| 63 | * Change to directory ``aeher-onramp/deps/amp/roles/monitor-load/templates/`` |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 64 | * Copy the dashboard JSON file to the ``dashboards/`` sub-directory |
| 65 | * Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator`` |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 66 | |
| 67 | Adding Service-specific Alerts |
| 68 | ------------------------------ |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 69 | |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 70 | An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager |
Larry Peterson | bd52968 | 2023-10-11 11:56:46 -0700 | [diff] [blame^] | 71 | then routes the alert to one or more receivers (e.g., an email address |
| 72 | or Slack channel). |
| 73 | |
| 74 | .. note:: This section on alerts is specific to an operational |
| 75 | instantiation of Aether that is no supported. A port of this |
| 76 | capability to Aether OnRamp (so it is available to anyone |
| 77 | that wants to operate Aether) is pending. |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 78 | |
| 79 | To add an alert for a component, create a |
| 80 | `PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_ |
| 81 | custom resource, for example in the Helm chart that deploys the component. This resource describes one or |
| 82 | more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions; |
| 83 | if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule |
| 84 | resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it. |
| 85 | |
| 86 | The Alertmanager is configured to send alerts with *critical* or *warning* severity to e-mail and Slack channels |
| 87 | monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver |
| 88 | (e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in |
| 89 | a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository. |
| 90 | To update the configuration: |
| 91 | |
| 92 | * Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/`` |
| 93 | * Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file |
| 94 | * Encode the ``alertmanager-config.yaml`` file as a Base64 string |
| 95 | * Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string |
| 96 | * Run the following command using a valid ``PUBLICKEY``: |
| 97 | |
| 98 | .. code-block:: shell |
| 99 | |
| 100 | $ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml |
| 101 | |
| 102 | * Commit the changes and submit patchset to gerrit |
| 103 | |
| 104 | Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret |
| 105 | by looking at the logs of the *sealed-secrets-controller* pod running on the cluster in the *kube-system* namespace. |