Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 1 | .. |
| 2 | SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org> |
| 3 | SPDX-License-Identifier: Apache-2.0 |
| 4 | |
Scott Baker | 3c7cfea | 2022-03-09 16:22:42 -0800 | [diff] [blame^] | 5 | Monitoring and Alert Infrastructure |
| 6 | =================================== |
Andy Bavier | 0478ee5 | 2021-07-29 17:05:22 -0700 | [diff] [blame] | 7 | |
| 8 | Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect |
| 9 | and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_ |
| 10 | to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to |
| 11 | notify Aether OPs staff of events requiring attention. This monitoring stack is running on each Aether cluster. |
| 12 | This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be |
| 13 | collected and graphed, and can trigger alerts. |
| 14 | |
| 15 | |
| 16 | Exporting Service Metrics to Prometheus |
| 17 | --------------------------------------- |
| 18 | An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_ |
| 19 | to expose its metrics to Prometheus. An exporter provides the current values of a components's |
| 20 | metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics |
| 21 | in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_ |
| 22 | are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available |
| 23 | in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics |
| 24 | into Prometheus format and export them. |
| 25 | |
| 26 | A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape |
| 27 | this endpoint by defining a |
| 28 | `ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_ |
| 29 | custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component. |
| 30 | |
| 31 | |
| 32 | Working with Grafana Dashboards |
| 33 | -------------------------------- |
| 34 | Once the local cluster's Prometheus is collecting a component's metrics, they can be visualized using Grafana |
| 35 | dashboards. The Grafana instance running on the AMP cluster is able to send queries to the Prometheus |
| 36 | servers running on all Aether clusters. This means that component metrics can be visualized on the AMP Grafana |
| 37 | regardless of where the component is actually running. |
| 38 | |
| 39 | In order to create a new Grafana dashboard or modify an existing one, first login to the AMP Grafana using an account |
| 40 | with admin privileges. To add a new dashboard, click the **+** at left. To make a copy of an existing dashboard for |
| 41 | editing, click the **Dashboard Settings** icon (gear icon) at upper right of the existing dashboard, and then |
| 42 | click the **Save as…** button at left. |
| 43 | |
| 44 | Next, add panels to the dashboard. Since Grafana can access Prometheus on all the clusters in the environment, |
| 45 | each cluster is available as a data source. For example, when adding a panel showing metrics collected on the |
| 46 | ace-menlo cluster, choose ace-menlo as the data source. |
| 47 | |
| 48 | Clicking on the floppy disk icon at top will save the dashboard *temporarily* (the dashboard is not |
| 49 | saved to persistent storage and is deleted as soon as Grafana is restarted). To save the dashboard *permanently*, |
| 50 | click the **Share Dashboard** icon next to the title and save its JSON to a file. Then add the file to the |
| 51 | aether-app-configs repository so that it will be deployed by Fleet: |
| 52 | |
| 53 | * Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<amp-cluster>/`` |
| 54 | * Copy the dashboard JSON file to the ``dashboards/`` sub-directory |
| 55 | * Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator`` |
| 56 | * Commit the changes and submit patchset to gerrit |
| 57 | |
| 58 | Once the patchset is merged, the AMP Grafana will automatically detect and deploy the new dashboard. |
| 59 | |
| 60 | Adding Service-specific Alerts |
| 61 | ------------------------------ |
| 62 | An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager |
| 63 | then routes the alert to one or more receivers (e.g., an email address or Slack channel). |
| 64 | |
| 65 | To add an alert for a component, create a |
| 66 | `PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_ |
| 67 | custom resource, for example in the Helm chart that deploys the component. This resource describes one or |
| 68 | more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions; |
| 69 | if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule |
| 70 | resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it. |
| 71 | |
| 72 | The Alertmanager is configured to send alerts with *critical* or *warning* severity to e-mail and Slack channels |
| 73 | monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver |
| 74 | (e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in |
| 75 | a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository. |
| 76 | To update the configuration: |
| 77 | |
| 78 | * Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/`` |
| 79 | * Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file |
| 80 | * Encode the ``alertmanager-config.yaml`` file as a Base64 string |
| 81 | * Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string |
| 82 | * Run the following command using a valid ``PUBLICKEY``: |
| 83 | |
| 84 | .. code-block:: shell |
| 85 | |
| 86 | $ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml |
| 87 | |
| 88 | * Commit the changes and submit patchset to gerrit |
| 89 | |
| 90 | Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret |
| 91 | by looking at the logs of the *sealed-secrets-controller* pod running on the cluster in the *kube-system* namespace. |