blob: b9d2ff4fd8d9c72be1896c92eae8bf3a7f4fe714 [file] [log] [blame]
Andy Bavier0478ee52021-07-29 17:05:22 -07001..
2 SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org>
3 SPDX-License-Identifier: Apache-2.0
4
5Monitoring and Alerts
6=====================
7
8Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect
9and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_
10to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to
11notify Aether OPs staff of events requiring attention. This monitoring stack is running on each Aether cluster.
12This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be
13collected and graphed, and can trigger alerts.
14
15
16Exporting Service Metrics to Prometheus
17---------------------------------------
18An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_
19to expose its metrics to Prometheus. An exporter provides the current values of a components's
20metrics via HTTP using a simple text format. Prometheus scrapes the exporter's HTTP endpoint and stores the metrics
21in its Time Series Database (TSDB) for querying and analysis. Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_
22are available for instrumenting code to export metrics in Prometheus format. If a component's metrics are available
23in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics
24into Prometheus format and export them.
25
26A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape
27this endpoint by defining a
28`ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_
29custom resource. The ServiceMonitor is typically created by the Helm chart that installs the component.
30
31
32Working with Grafana Dashboards
33--------------------------------
34Once the local cluster's Prometheus is collecting a component's metrics, they can be visualized using Grafana
35dashboards. The Grafana instance running on the AMP cluster is able to send queries to the Prometheus
36servers running on all Aether clusters. This means that component metrics can be visualized on the AMP Grafana
37regardless of where the component is actually running.
38
39In order to create a new Grafana dashboard or modify an existing one, first login to the AMP Grafana using an account
40with admin privileges. To add a new dashboard, click the **+** at left. To make a copy of an existing dashboard for
41editing, click the **Dashboard Settings** icon (gear icon) at upper right of the existing dashboard, and then
42click the **Save as…** button at left.
43
44Next, add panels to the dashboard. Since Grafana can access Prometheus on all the clusters in the environment,
45each cluster is available as a data source. For example, when adding a panel showing metrics collected on the
46ace-menlo cluster, choose ace-menlo as the data source.
47
48Clicking on the floppy disk icon at top will save the dashboard *temporarily* (the dashboard is not
49saved to persistent storage and is deleted as soon as Grafana is restarted). To save the dashboard *permanently*,
50click the **Share Dashboard** icon next to the title and save its JSON to a file. Then add the file to the
51aether-app-configs repository so that it will be deployed by Fleet:
52
53* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<amp-cluster>/``
54* Copy the dashboard JSON file to the ``dashboards/`` sub-directory
55* Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator``
56* Commit the changes and submit patchset to gerrit
57
58Once the patchset is merged, the AMP Grafana will automatically detect and deploy the new dashboard.
59
60Adding Service-specific Alerts
61------------------------------
62An alert can be triggered in Prometheus when a component metric crosses a threshold. The Alertmanager
63then routes the alert to one or more receivers (e.g., an email address or Slack channel).
64
65To add an alert for a component, create a
66`PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_
67custom resource, for example in the Helm chart that deploys the component. This resource describes one or
68more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions;
69if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule
70resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it.
71
72The Alertmanager is configured to send alerts with *critical* or *warning* severity to e-mail and Slack channels
73monitored by Aether OPs staff. If it is desirable to route a specific alert to a different receiver
74(e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in
75a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository.
76To update the configuration:
77
78* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/``
79* Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file
80* Encode the ``alertmanager-config.yaml`` file as a Base64 string
81* Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string
82* Run the following command using a valid ``PUBLICKEY``:
83
84.. code-block:: shell
85
86 $ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml
87
88* Commit the changes and submit patchset to gerrit
89
90Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret
91by looking at the logs of the *sealed-secrets-controller* pod running on the cluster in the *kube-system* namespace.