Add Monitoring section to AMP Change-Id: I166ceafe755612898d225e5fbbe4e279f222851a

commit: 0478ee5dd1642b621d2d5d22bc25046f87a03421 [log] [tgz]
author: Andy Bavier <andy@opennetworking.org> Thu Jul 29 17:05:22 2021 -0700
committer: Andy Bavier <andy@opennetworking.org> Fri Jul 30 10:58:09 2021 -0700
tree: 89ca565af65d205f12247bf92679558600dab3c3
parent: 718fd90b3fe4a18bcdce78d8ad3149bbdbbbb344 [diff]
diff --git a/amp/monitoring.rst b/amp/monitoring.rst
new file mode 100644
index 0000000..b9d2ff4
--- /dev/null
+++ b/amp/monitoring.rst

@@ -0,0 +1,91 @@
+..
+   SPDX-FileCopyrightText: © 2021 Open Networking Foundation <support@opennetworking.org>
+   SPDX-License-Identifier: Apache-2.0
+
+Monitoring and Alerts
+=====================
+
+Aether leverages `Prometheus <https://prometheus.io/docs/introduction/overview/>`_ to collect
+and store platform and service metrics, `Grafana <https://grafana.com/docs/grafana/latest/getting-started/>`_
+to visualize metrics over time, and `Alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>`_ to
+notify Aether OPs staff of events requiring attention.  This monitoring stack is running on each Aether cluster.
+This section describes how an Aether component can "opt in" to the Aether monitoring stack so that its metrics can be
+collected and graphed, and can trigger alerts.
+
+
+Exporting Service Metrics to Prometheus
+---------------------------------------
+An Aether component implements a `Prometheus exporter <https://prometheus.io/docs/instrumenting/writing_exporters/>`_
+to expose its metrics to Prometheus.  An exporter provides the current values of a components's
+metrics via HTTP using a simple text format.  Prometheus scrapes the exporter's HTTP endpoint and stores the metrics
+in its Time Series Database (TSDB) for querying and analysis.  Many `client libraries <https://prometheus.io/docs/instrumenting/clientlibs/>`_
+are available for instrumenting code to export metrics in Prometheus format.  If a component's metrics are available
+in some other format, tools like `Telegraf <https://docs.influxdata.com/telegraf>`_ can be used to convert the metrics
+into Prometheus format and export them.
+
+A component that exposes a Prometheus exporter HTTP endpoint via a Service can tell Prometheus to scrape
+this endpoint by defining a
+`ServiceMonitor <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md>`_
+custom resource.  The ServiceMonitor is typically created by the Helm chart that installs the component.
+
+
+Working with Grafana Dashboards
+--------------------------------
+Once the local cluster's Prometheus is collecting a component's metrics, they can be visualized using Grafana
+dashboards.  The Grafana instance running on the AMP cluster is able to send queries to the Prometheus
+servers running on all Aether clusters.  This means that component metrics can be visualized on the AMP Grafana
+regardless of where the component is actually running.
+
+In order to create a new Grafana dashboard or modify an existing one, first login to the AMP Grafana using an account
+with admin privileges.  To add a new dashboard, click the **+** at left.  To make a copy of an existing dashboard for
+editing, click the **Dashboard Settings** icon (gear icon) at upper right of the existing dashboard, and then
+click the **Save as…** button at left.
+
+Next, add panels to the dashboard.  Since Grafana can access Prometheus on all the clusters in the environment,
+each cluster is available as a data source.  For example, when adding a panel showing metrics collected on the
+ace-menlo cluster, choose ace-menlo as the data source.
+
+Clicking on the floppy disk icon at top will save the dashboard *temporarily* (the dashboard is not
+saved to persistent storage and is deleted as soon as Grafana is restarted).  To save the dashboard *permanently*,
+click the **Share Dashboard** icon next to the title and save its JSON to a file.  Then add the file to the
+aether-app-configs repository so that it will be deployed by Fleet:
+
+* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<amp-cluster>/``
+* Copy the dashboard JSON file to the ``dashboards/`` sub-directory
+* Edit ``kustomization.yaml`` and add the new dashboard JSON under ``configmapGenerator``
+* Commit the changes and submit patchset to gerrit
+
+Once the patchset is merged, the AMP Grafana will automatically detect and deploy the new dashboard.
+
+Adding Service-specific Alerts
+------------------------------
+An alert can be triggered in Prometheus when a component metric crosses a threshold.  The Alertmanager
+then routes the alert to one or more receivers (e.g., an email address or Slack channel).
+
+To add an alert for a component, create a
+`PrometheusRule <https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md>`_
+custom resource, for example in the Helm chart that deploys the component.  This resource describes one or
+more `rules <https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/>`_ using Prometheus expressions;
+if the expression is true for the time indicated, then the alert is raised. Once the PrometheusRule
+resource is instantiated, the cluster's Prometheus will pick up the rule and start evaluating it.
+
+The Alertmanager is configured to send alerts with *critical* or *warning* severity to e-mail and Slack channels
+monitored by Aether OPs staff.  If it is desirable to route a specific alert to a different receiver
+(e.g., a component-specific Slack channel), it is necessary to change the Alertmanager configuration. This is stored in
+a `SealedSecret <https://github.com/bitnami-labs/sealed-secrets>`_ custom resource in the aether-app-configs repository.
+To update the configuration:
+
+* Change to directory ``aether-app-configs/infrastructure/rancher-monitoring/overlays/<cluster>/``
+* Update the ``receivers`` and ``route`` sections of the ``alertmanager-config.yaml`` file
+* Encode the ``alertmanager-config.yaml`` file as a Base64 string
+* Create a file ``alertmanager-config-secret.yaml`` to define the Secret resource using the Base64-encoded string
+* Run the following command using a valid ``PUBLICKEY``:
+
+.. code-block:: shell
+
+   $ kubeseal --cert "${PUBLICKEY}" --scope cluster-wide --format yaml < alertmanager-config-secret.yaml > alertmanager-config-sealed-secret.yaml
+
+* Commit the changes and submit patchset to gerrit
+
+Once the patchset is merged, verify that the SealedSecret was successfully unsealed and converted to a Secret
+by looking at the logs of the *sealed-secrets-controller* pod running on the cluster in the *kube-system* namespace.

diff --git a/index.rst b/index.rst
index a0d2324..cf28a3f 100644
--- a/index.rst
+++ b/index.rst

@@ -41,6 +41,7 @@
    :glob:
 
    amp/roc
+   amp/monitoring
 
 .. toctree::
    :maxdepth: 3
commit	0478ee5dd1642b621d2d5d22bc25046f87a03421	[log] [tgz]
author	Andy Bavier <andy@opennetworking.org>	Thu Jul 29 17:05:22 2021 -0700
committer	Andy Bavier <andy@opennetworking.org>	Fri Jul 30 10:58:09 2021 -0700
tree	89ca565af65d205f12247bf92679558600dab3c3
parent	718fd90b3fe4a18bcdce78d8ad3149bbdbbbb344 [diff]