blob: d72a1b5d1d3f13a4727147f56f575acb7a1621a8 [file] [log] [blame]
Mohammed Naser90128aa2024-04-29 13:21:58 -04001#########################
2Monitoring and Operations
3#########################
4
5There is a Grafana deployment with a few dashboards that are created by default
6and a Prometheus deployment that is used to collect metrics from the cluster
7which sends alerts to AlertManager. In addition, Loki is deployed to collect
8logs from the cluster using Vector.
9
10******************************
11Philosophy and Alerting Levels
12******************************
13
14Atmosphere's monitoring philosophy is strongly aligned with the principles
15outlined in the Google Site Reliability Engineering (SRE) book. Our approach
16focuses on alerting on conditions that are symptomatic of issues which directly
17impact the service or system health, rather than simply monitoring the state of
18individual components.
19
20Alerting Philosophy
21===================
22
23Our alerting philosophy aims to alert the right people at the right time. Most
24alerts, if they are affecting a single system, would trigger a lower priority
25level (P4 or P5). However, if an issue is affecting the entire control plane of
26a specific service, it might escalate to a P3 or P2. And if the whole service
27is unavailable, it becomes a P1.
28
29We believe in minimizing alert noise to ensure that alerts are meaningful and
30actionable. Our goal is to have every alert provide enough information to
31initiate an immediate and effective response, regardless of business hours for
32high priority alerts.
33
34We continue to refine our monitoring and alerting strategies to ensure that we
35are effectively identifying and responding to incidents. The ultimate goal is
36to provide a reliable and high-quality service to all our users.
37
38Severity Levels
39===============
40
41Our alerting system classifies incidents into different severity levels based on
42their impact on the system and users.
43
44**P1**: Critical
45 This level is used for incidents causing a complete service disruption or
46 significant loss of functionality across the entire Atmosphere platform.
47 Immediate response, attention, and action are necessary regardless of
48 business hours.
49
50**P2**: High
51 This level is for incidents that affect a large group of users or critical
52 system components. These incidents require swift attention and action,
53 regardless of business hours, but do not cause a total disruption.
54
55**P3**: Moderate
56 This level is for incidents that affect a smaller group of users or a single
57 system. These incidents require attention and may necessitate action during
58 business hours.
59
60**P4**: Low
61 This level is used for minor issues that have a limited impact on a small
62 subset of users or system functionality. These incidents require attention
63 and action, if necessary, during standard business hours.
64
65**P5**: Informational
66 This is the lowest level of severity, used for providing information about
67 normal system activities or minor issues that don't significantly impact
68 users or system functionality. These incidents typically do not require
69 immediate attention or action and are addressed during standard business
70 hours.
71
72**********************
73Operational Procedures
74**********************
75
76Creating silences
77=================
78
79In order to create a silence, you'll need to login to your Grafana instance that
80is deployed as part of Atmosphere as an admin user.
81
821. Click on the hamburger menu in the top left corner and select "Alerting"
83 and then "Silences" from the menu.
84
85 .. image:: images/monitoring-silences-menu.png
86 :alt: Silences menu
87 :width: 200
88
892. Ensure that you select "AlertManager" on the top right corner of the page,
90 this will make sure that you create a silence inside of the AlertManager
91 that is managed by the Prometheus operator instead of the built-in Grafana
92 AlertManager which is not used.
93
94 .. image:: images/monitoring-alertmanger-list.png
95 :alt: AlertManager list
96 :width: 200
97
98 .. admonition:: AlertManager selection
99 :class: warning
100
101 It's important that you select the AlertManager that is managed by the
102 Prometheus operator, otherwise your silence will not be applied to the
103 Prometheus instance that is deployed as part of Atmosphere.
104
1053. Click the "Add Silence" button and use the AlertManager format to create
106 your silence, which you can test by seeing if it matches any alerts in the
107 list labeled "Affected alert instances".
108
109.. admonition:: Limit the number of labels
110 :class: info
111
112 It is important to limit the number of labels that you use in your silence
113 to ensure that it will continue to work even if the alerts are modified.
114
115 For example, if you have an alert that is labeled with the following labels:
116
117 - ``alertname``
118 - ``instance``
119 - ``job``
120 - ``severity``
121
122 You should only use the ``alertname`` and ``severity`` labels in your
123 silence to ensure that it will continue to work even if the ``instance``
124 or ``job`` labels are modified.
125
126**************
127Configurations
128**************
129
130Dashboard Management
131====================
132
133For Grafana, rather than enabling persistence through the application's user
134interface or manual Helm chart modifications, dashboards should be managed
135directly via the Helm chart values.
136
137.. admonition:: Avoid Manual Persistence Configurations!
138 :class: warning
139
140 It is important to avoid manual persistence configurations, especially for
141 services like Grafana, where dashboards and data sources can be saved. Such
142 practices are not captured in version control and pose a risk of data loss,
143 configuration drift, and upgrade complications.
144
145To manage Grafana dashboards through Helm, you can include the dashboard
146definitions within your configuration file. By doing so, you facilitate
147version-controlled dashboard configurations that can be replicated across
148different deployments without manual intervention.
149
150For example, a dashboard can be defined in the Helm values like this:
151
152.. code-block:: yaml
153
154 kube_prometheus_stack_helm_values:
155 grafana:
156 dashboards:
157 default:
158 my-dashboard:
159 gnetId: 10000
160 revision: 1
161 datasource: Prometheus
162
163This instructs Helm to fetch and configure the specified dashboard from
164`Grafana.com dashboards <https://grafana.com/grafana/dashboards/>`_, using
165Prometheus as the data source. You can find more examples of how to do
166this in the Grafana Helm chart `Import Dashboards <https://github.com/grafana/helm-charts/tree/main/charts/grafana#import-dashboards>`_
167documentation.
168
169************
170Viewing data
171************
172
173There are a few different ways to view the data that is collected by the
174monitoring stack. The most common ways are through AlertManager, Grafana, and
175Prometheus.
176
177Grafana dashboard
178=================
179
180By default, an ``Ingress`` is created for Grafana using the
181``kube_prometheus_stack_grafana_host`` variable. The authentication is done
182using the Keycloak service which is deployed by default.
183
184Inside Keycloak, there are two client roles that are created for Grafana:
185
186``grafana:admin``
187 Has access to all organization resources, including dashboards, users, and
188 teams.
189
190``grafana:editor``
191 Can view and edit dashboards, folders, and playlists.
192
193``grafana:viewer``
194 Can view dashboards, playlists, and query data sources.
195
196You can view the existing dashboards by going to *Manage* > *Dashboards*. You
197can also check any alerts that are currently firing by going to *Alerting* >
198*Alerts*.
199
200Prometheus
201==========
202
203By default, Prometheus is exposed behind an ``Ingress`` using the
204``kube_prometheus_stack_prometheus_host`` variable. In addition, it is also
205running behind the `oauth2-proxy` service which is used for authentication
206so that only authenticated users can access the Prometheus UI.
207
208Alternative Authentication
209--------------------------
210
211It is possible to by-pass the `oauth2-proxy` service and use an alternative
212authentication method to access the Prometheus UI. In both cases, we will
213be overriding the ``servicePort`` on the ``Ingress`` to point to the port
214where Prometheus is running and not the `oauth2-proxy` service.
215
216.. admonition:: Advanced Usage Only
217 :class: warning
218
219 It's strongly recommended that you stick to keeping the `oauth2-proxy`
220 service in front of the Prometheus UI. The `oauth2-proxy` service is
221 responsible for authenticating users and ensuring that only authenticated
222 users can access the Prometheus UI.
223
224Basic Authentication
225~~~~~~~~~~~~~~~~~~~~
226
227If you want to rely on basic authentication to access the Prometheus UI instead
228of using the `oauth2-proxy` service to expose it over single sign-on, you can
229do so by making the following changes to your inventory:
230
231.. code-block:: yaml
232
233 kube_prometheus_stack_helm_values:
234 prometheus:
235 ingress:
236 servicePort: 8080
237 annotations:
238 nginx.ingress.kubernetes.io/auth-type: basic
239 nginx.ingress.kubernetes.io/auth-secret: basic-auth-secret-name
240
241In the example above, we are using the ``basic-auth-secret-name`` secret to
242authenticate users. The secret should be created in the same namespace as the
243Prometheus deployment based on the `Ingress NGINX Annotations <https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#annotations>`_.
244
245IP Whitelisting
246~~~~~~~~~~~~~~~
247
248If you want to whitelist specific IPs to access the Prometheus UI, you can do
249so by making the following changes to your inventory:
250
251.. code-block:: yaml
252
253 kube_prometheus_stack_helm_values:
254 prometheus:
255 ingress:
256 servicePort: 8080
257 annotations:
258 nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/24,172.10.0.1"
259
260In the example above, we are whitelisting the IP range ``10.0.0.0/24`` and the IP address
261``172.10.0.1``.
262
263AlertManager
264============
265
266By default, the AlertManager dashboard is pointing to the Ansible variable
267``kube_prometheus_stack_alertmanager_host`` and is exposed using an ``Ingress``
268behind the `oauth2-proxy` service, protected by Keycloak similar to Prometheus.
269
270************
271Integrations
272************
273
274Since Atmosphere relies on AlertManager to send alerts, it is possible to
275integrate it with services like OpsGenie, PagerDuty, email and more. To
276receive monitoring alerts using your preferred notification tools, you'll
277need to integrate them with AlertManager.
278
279OpsGenie
280========
281
282In order to get started, you will need to complete the following steps inside
283OpsGenie:
284
2851. Create an integration inside OpsGenie, you can do this by going to
286 *Settings* > *Integrations* > *Add Integration* and selecting *Prometheus*.
2872. Copy the API key that is generated for you and setup correct assignment
288 rules inside OpsGenie.
2893. Create a new heartbeat inside OpsGenie, you can do this by going to
290 *Settings* > *Heartbeats* > *Create Heartbeat*. Set the interval to 1 minute.
291
292Afterwards, you can configure the following options for the Atmosphere config,
293making sure that you replace the placeholders with the correct values:
294
295``API_KEY``
296 The API key that you copied from the OpsGenie integration.
297
298``HEARTBEAT_NAME``
299 The name of the heartbeat that you created inside OpsGenie
300
301.. code-block:: yaml
302
303 kube_prometheus_stack_helm_values:
304 alertmanager:
305 config:
306 receivers:
307 - name: "null"
308 - name: notifier
309 opsgenie_configs:
310 - api_key: API_KEY
311 message: >-
312 {% raw -%}
313 {{ .GroupLabels.alertname }}
314 {%- endraw %}
315 priority: >-
316 {% raw -%}
317 {{- if eq .GroupLabels.severity "critical" -}}
318 P1
319 {{- else if eq .GroupLabels.severity "warning" -}}
320 P3
321 {{- else if eq .GroupLabels.severity "info" -}}
322 P5
323 {{- else -}}
324 {{ .GroupLabels.severity }}
325 {{- end -}}
326 {%- endraw %}
327 description: |-
328 {% raw -%}
329 {{ if gt (len .Alerts.Firing) 0 -}}
330 Alerts Firing:
331 {{ range .Alerts.Firing }}
332 - Message: {{ .Annotations.message }}
333 Labels:
334 {{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
335 {{ end }} Annotations:
336 {{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
337 {{ end }} Source: {{ .GeneratorURL }}
338 {{ end }}
339 {{- end }}
340 {{ if gt (len .Alerts.Resolved) 0 -}}
341 Alerts Resolved:
342 {{ range .Alerts.Resolved }}
343 - Message: {{ .Annotations.message }}
344 Labels:
345 {{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
346 {{ end }} Annotations:
347 {{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
348 {{ end }} Source: {{ .GeneratorURL }}
349 {{ end }}
350 {{- end }}
351 {%- endraw %}
352 - name: heartbeat
353 webhook_configs:
354 - url: https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping
355 send_resolved: false
356 http_config:
357 basic_auth:
358 password: API_KEY
359
360Once this is done and deployed, you'll start to see alerts inside OpsGenie and
361you can also verify that the heartbeat is listed as *ACTIVE*.
362
363PagerDuty
364=========
365
366To integrate with Pagerduty, first you need to prepare an *Integration key*. In
367order to do that, you must decide how you want to integrate with PagerDuty since
368there are two ways to do it:
369
370**Event Orchestration**
371 This method is beneficial if you want to build different routing rules based
372 on the events coming from the integrated tool.
373
374**PagerDuty Service Integration**
375 This method is beneficial if you don't need to route alerts from the integrated
376 tool to different responders based on the event payload.
377
378For both of these methods, you need to create an *Integration key* in PagerDuty
379using the `PagerDuty Integration Guide <https://www.pagerduty.com/docs/guides/prometheus-integration-guide/>`_.
380
381Once you're done, you'll need to configure the inventory with the following
382options:
383
384.. code-block:: yaml
385
386 kube_prometheus_stack_helm_values:
387 alertmanager:
388 config:
389 receivers:
390 - name: notifier
391 pagerduty_configs:
392 - service_key: '<your integration key here>'
393
394You can find more details about
395`pagerduty_configs <https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config>`_
396in the Prometheus documentation.
397
398Email
399=====
400
401To integrate with email, you need to configure the following options in the
402inventory:
403
404.. code-block:: yaml
405
406 kube_prometheus_stack_helm_values:
407 alertmanager:
408 config:
409 receivers:
410 - name: notifier
411 email_configs:
412 - smarthost: 'smtp.gmail.com:587'
413 auth_username: '<your email id here>'
414 auth_password: '<your email password here>'
415 from: '<your email id here>'
416 to: '<receiver's email id here>'
417 headers:
418 subject: 'Prometheus Mail Alerts'
419
420You can find more details about
421`email_configs <https://prometheus.io/docs/alerting/latest/configuration/#email_configs>`_
422in the Prometheus documentation.
423
424****************
425Alerts Reference
426****************
427
428``etcdDatabaseHighFragmentationRatio``
429 This alert is triggered when the etcd database has a high fragmentation ratio
430 which can cause performance issues on the cluster. In order to resolve this
431 issue, you can use the following command:
432
433 .. code-block:: console
434
435 kubectl -n kube-system exec svc/kube-prometheus-stack-kube-etcd -- \
436 etcdctl defrag \
437 --cluster \
438 --cacert /etc/kubernetes/pki/etcd/ca.crt \
439 --key /etc/kubernetes/pki/etcd/server.key \
440 --cert /etc/kubernetes/pki/etcd/server.crt
441
442``NodeNetworkMulticast``
443 This alert is triggered when a node is receiving large volumes of multicast
444 traffic which can be a sign of a misconfigured network or a malicious actor.
445
446 This can result in high CPU usage on the node and can cause the node to become
447 unresponsive. Also, it can be the cause of a very high amount of software
448 interrupts on the node.
449
450 In order to find the root cause of this issue, you can use the following
451 commands:
452
453 .. code-block:: console
454
455 iftop -ni $DEV -f 'multicast and not broadcast'
456
457 With the command above, you're able to see which IP addresses are sending the
458 multicast traffic. Once you have the IP address, you can use the following
459 command to find the server behind it:
460
461 .. code-block:: console
462
463 openstack server list --all-projects --long -n --ip $IP
ricolin9b28d852025-03-04 16:13:47 +0800464
465``EtcdMembersDown``
466 If any alarms are fired from Promethetus for ``etcd`` issues such as ``TargetDown``,
467 ``etcdMembersDown``, or ``etcdInsufficientMembers``), it could be due to expired
468 certificates. You can update the certificates that ``kube-prometheus-stack`` uses for
469 talking with ``etcd`` with the following commands:
470
471 .. code-block:: console
472
473 kubectl -n monitoring delete secret/kube-prometheus-stack-etcd-client-cert
474 kubectl -n monitoring create secret generic kube-prometheus-stack-etcd-client-cert \
475 --from-file=/etc/kubernetes/pki/etcd/ca.crt \
476 --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
477 --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key