Andy Bavier | 0c586ca | 2021-03-12 14:36:40 -0700 | [diff] [blame] | 1 | # Copyright 2020-present Open Networking Foundation |
| 2 | # SPDX-License-Identifier: LicenseRef-ONF-Member-Only-1.0 |
| 3 | |
| 4 | {{- if .Values.alerts.enabled }} |
| 5 | apiVersion: monitoring.coreos.com/v1 |
| 6 | kind: PrometheusRule |
| 7 | metadata: |
| 8 | name: {{ include "edge-monitoring-server.fullname" . }} |
| 9 | labels: |
| 10 | {{- include "edge-monitoring-server.labels" . | nindent 4 }} |
| 11 | spec: |
| 12 | groups: |
| 13 | - name: ace-e2e-tests-v2.rules |
| 14 | rules: |
| 15 | - alert: ScheduledDowntime |
| 16 | annotations: |
| 17 | message: The cluster {{`{{ .Labels.name }}`}} is undergoing scheduled maintenance. |
| 18 | expr: aetheredge_in_maintenance_window{endpoint="metrics80"} > 0 |
| 19 | for: 1m |
| 20 | labels: |
| 21 | severity: info |
| 22 | - alert: SingleEdgeTestNotReporting |
| 23 | annotations: |
| 24 | message: | |
Andy Bavier | 40e72fa | 2021-03-19 10:50:33 -0700 | [diff] [blame] | 25 | The E2E test on cluster {{`{{ .Labels.name }}`}} has not reported results for at least 5 minutes. |
| 26 | expr: (time() - aetheredge_last_update{endpoint="metrics80"}) > 300 |
Andy Bavier | 0c586ca | 2021-03-12 14:36:40 -0700 | [diff] [blame] | 27 | for: 1m |
| 28 | labels: |
| 29 | severity: critical |
| 30 | - alert: SingleEdgeConnectTestFailing |
| 31 | annotations: |
| 32 | message: | |
| 33 | The E2E test on cluster {{`{{ .Labels.name }}`}} is reporting UE connect failure for at least 10 minutes. |
| 34 | expr: aetheredge_connect_test_ok{endpoint="metrics80"} < 1 |
| 35 | for: 10m |
| 36 | labels: |
| 37 | severity: critical |
| 38 | - alert: SingleEdgePingTestFailing |
| 39 | annotations: |
| 40 | message: | |
| 41 | The E2E test on cluster {{`{{ .Labels.name }}`}} is reporting that UE cannot ping the Internet for at least 10 minutes. |
| 42 | expr: aetheredge_ping_test_ok{endpoint="metrics80"} < 1 |
Andy Bavier | 40e72fa | 2021-03-19 10:50:33 -0700 | [diff] [blame] | 43 | for: 11m |
Andy Bavier | 0c586ca | 2021-03-12 14:36:40 -0700 | [diff] [blame] | 44 | labels: |
| 45 | severity: critical |
Andy Bavier | 0c83a86 | 2021-03-17 10:18:44 -0700 | [diff] [blame] | 46 | {{- if .Values.alerts.manyEdgeConnectTestsFailing }} |
Andy Bavier | 0c586ca | 2021-03-12 14:36:40 -0700 | [diff] [blame] | 47 | - alert: ManyEdgeConnectTestsFailing |
| 48 | annotations: |
| 49 | message: | |
Andy Bavier | 0c83a86 | 2021-03-17 10:18:44 -0700 | [diff] [blame] | 50 | Over half of the clusters are reporting UE connect failures. |
Andy Bavier | 8c75711 | 2021-03-15 11:16:33 -0700 | [diff] [blame] | 51 | expr: avg(clamp_max(aetheredge_connect_test_ok{endpoint="metrics80"} + aetheredge_in_maintenance_window{endpoint="metrics80"}, 1)) < 0.5 |
Andy Bavier | 40e72fa | 2021-03-19 10:50:33 -0700 | [diff] [blame] | 52 | for: 5m |
Andy Bavier | 0c586ca | 2021-03-12 14:36:40 -0700 | [diff] [blame] | 53 | labels: |
| 54 | severity: critical |
| 55 | {{- end }} |
Andy Bavier | 0c83a86 | 2021-03-17 10:18:44 -0700 | [diff] [blame] | 56 | {{- end }} |