Blame - doc/source/admin/monitoring.rst - atmosphere

blob: d72a1b5d1d3f13a4727147f56f575acb7a1621a8 [file] [log] [blame]

Mohammed Naser	90128aa	2024-04-29 13:21:58 -0400	[diff] [blame]	1	#########################
				2	Monitoring and Operations
				3	#########################
				4
				5	There is a Grafana deployment with a few dashboards that are created by default
				6	and a Prometheus deployment that is used to collect metrics from the cluster
				7	which sends alerts to AlertManager. In addition, Loki is deployed to collect
				8	logs from the cluster using Vector.
				9
				10	******************************
				11	Philosophy and Alerting Levels
				12	******************************
				13
				14	Atmosphere's monitoring philosophy is strongly aligned with the principles
				15	outlined in the Google Site Reliability Engineering (SRE) book. Our approach
				16	focuses on alerting on conditions that are symptomatic of issues which directly
				17	impact the service or system health, rather than simply monitoring the state of
				18	individual components.
				19
				20	Alerting Philosophy
				21	===================
				22
				23	Our alerting philosophy aims to alert the right people at the right time. Most
				24	alerts, if they are affecting a single system, would trigger a lower priority
				25	level (P4 or P5). However, if an issue is affecting the entire control plane of
				26	a specific service, it might escalate to a P3 or P2. And if the whole service
				27	is unavailable, it becomes a P1.
				28
				29	We believe in minimizing alert noise to ensure that alerts are meaningful and
				30	actionable. Our goal is to have every alert provide enough information to
				31	initiate an immediate and effective response, regardless of business hours for
				32	high priority alerts.
				33
				34	We continue to refine our monitoring and alerting strategies to ensure that we
				35	are effectively identifying and responding to incidents. The ultimate goal is
				36	to provide a reliable and high-quality service to all our users.
				37
				38	Severity Levels
				39	===============
				40
				41	Our alerting system classifies incidents into different severity levels based on
				42	their impact on the system and users.
				43
				44	P1: Critical
				45	This level is used for incidents causing a complete service disruption or
				46	significant loss of functionality across the entire Atmosphere platform.
				47	Immediate response, attention, and action are necessary regardless of
				48	business hours.
				49
				50	P2: High
				51	This level is for incidents that affect a large group of users or critical
				52	system components. These incidents require swift attention and action,
				53	regardless of business hours, but do not cause a total disruption.
				54
				55	P3: Moderate
				56	This level is for incidents that affect a smaller group of users or a single
				57	system. These incidents require attention and may necessitate action during
				58	business hours.
				59
				60	P4: Low
				61	This level is used for minor issues that have a limited impact on a small
				62	subset of users or system functionality. These incidents require attention
				63	and action, if necessary, during standard business hours.
				64
				65	P5: Informational
				66	This is the lowest level of severity, used for providing information about
				67	normal system activities or minor issues that don't significantly impact
				68	users or system functionality. These incidents typically do not require
				69	immediate attention or action and are addressed during standard business
				70	hours.
				71
				72	**********************
				73	Operational Procedures
				74	**********************
				75
				76	Creating silences
				77	=================
				78
				79	In order to create a silence, you'll need to login to your Grafana instance that
				80	is deployed as part of Atmosphere as an admin user.
				81
				82	1. Click on the hamburger menu in the top left corner and select "Alerting"
				83	and then "Silences" from the menu.
				84
				85	.. image:: images/monitoring-silences-menu.png
				86	:alt: Silences menu
				87	:width: 200
				88
				89	2. Ensure that you select "AlertManager" on the top right corner of the page,
				90	this will make sure that you create a silence inside of the AlertManager
				91	that is managed by the Prometheus operator instead of the built-in Grafana
				92	AlertManager which is not used.
				93
				94	.. image:: images/monitoring-alertmanger-list.png
				95	:alt: AlertManager list
				96	:width: 200
				97
				98	.. admonition:: AlertManager selection
				99	:class: warning
				100
				101	It's important that you select the AlertManager that is managed by the
				102	Prometheus operator, otherwise your silence will not be applied to the
				103	Prometheus instance that is deployed as part of Atmosphere.
				104
				105	3. Click the "Add Silence" button and use the AlertManager format to create
				106	your silence, which you can test by seeing if it matches any alerts in the
				107	list labeled "Affected alert instances".
				108
				109	.. admonition:: Limit the number of labels
				110	:class: info
				111
				112	It is important to limit the number of labels that you use in your silence
				113	to ensure that it will continue to work even if the alerts are modified.
				114
				115	For example, if you have an alert that is labeled with the following labels:
				116
				117	- ``alertname``
				118	- ``instance``
				119	- ``job``
				120	- ``severity``
				121
				122	You should only use the ``alertname`` and ``severity`` labels in your
				123	silence to ensure that it will continue to work even if the ``instance``
				124	or ``job`` labels are modified.
				125
				126	**************
				127	Configurations
				128	**************
				129
				130	Dashboard Management
				131	====================
				132
				133	For Grafana, rather than enabling persistence through the application's user
				134	interface or manual Helm chart modifications, dashboards should be managed
				135	directly via the Helm chart values.
				136
				137	.. admonition:: Avoid Manual Persistence Configurations!
				138	:class: warning
				139
				140	It is important to avoid manual persistence configurations, especially for
				141	services like Grafana, where dashboards and data sources can be saved. Such
				142	practices are not captured in version control and pose a risk of data loss,
				143	configuration drift, and upgrade complications.
				144
				145	To manage Grafana dashboards through Helm, you can include the dashboard
				146	definitions within your configuration file. By doing so, you facilitate
				147	version-controlled dashboard configurations that can be replicated across
				148	different deployments without manual intervention.
				149
				150	For example, a dashboard can be defined in the Helm values like this:
				151
				152	.. code-block:: yaml
				153
				154	kube_prometheus_stack_helm_values:
				155	grafana:
				156	dashboards:
				157	default:
				158	my-dashboard:
				159	gnetId: 10000
				160	revision: 1
				161	datasource: Prometheus
				162
				163	This instructs Helm to fetch and configure the specified dashboard from
				164	`Grafana.com dashboards <https://grafana.com/grafana/dashboards/>`_, using
				165	Prometheus as the data source. You can find more examples of how to do
				166	this in the Grafana Helm chart `Import Dashboards <https://github.com/grafana/helm-charts/tree/main/charts/grafana#import-dashboards>`_
				167	documentation.
				168
				169	************
				170	Viewing data
				171	************
				172
				173	There are a few different ways to view the data that is collected by the
				174	monitoring stack. The most common ways are through AlertManager, Grafana, and
				175	Prometheus.
				176
				177	Grafana dashboard
				178	=================
				179
				180	By default, an ``Ingress`` is created for Grafana using the
				181	``kube_prometheus_stack_grafana_host`` variable. The authentication is done
				182	using the Keycloak service which is deployed by default.
				183
				184	Inside Keycloak, there are two client roles that are created for Grafana:
				185
				186	``grafana:admin``
				187	Has access to all organization resources, including dashboards, users, and
				188	teams.
				189
				190	``grafana:editor``
				191	Can view and edit dashboards, folders, and playlists.
				192
				193	``grafana:viewer``
				194	Can view dashboards, playlists, and query data sources.
				195
				196	You can view the existing dashboards by going to Manage > Dashboards. You
				197	can also check any alerts that are currently firing by going to Alerting >
				198	Alerts.
				199
				200	Prometheus
				201	==========
				202
				203	By default, Prometheus is exposed behind an ``Ingress`` using the
				204	``kube_prometheus_stack_prometheus_host`` variable. In addition, it is also
				205	running behind the `oauth2-proxy` service which is used for authentication
				206	so that only authenticated users can access the Prometheus UI.
				207
				208	Alternative Authentication
				209	--------------------------
				210
				211	It is possible to by-pass the `oauth2-proxy` service and use an alternative
				212	authentication method to access the Prometheus UI. In both cases, we will
				213	be overriding the ``servicePort`` on the ``Ingress`` to point to the port
				214	where Prometheus is running and not the `oauth2-proxy` service.
				215
				216	.. admonition:: Advanced Usage Only
				217	:class: warning
				218
				219	It's strongly recommended that you stick to keeping the `oauth2-proxy`
				220	service in front of the Prometheus UI. The `oauth2-proxy` service is
				221	responsible for authenticating users and ensuring that only authenticated
				222	users can access the Prometheus UI.
				223
				224	Basic Authentication
				225	~~~~~~~~~~~~~~~~~~~~
				226
				227	If you want to rely on basic authentication to access the Prometheus UI instead
				228	of using the `oauth2-proxy` service to expose it over single sign-on, you can
				229	do so by making the following changes to your inventory:
				230
				231	.. code-block:: yaml
				232
				233	kube_prometheus_stack_helm_values:
				234	prometheus:
				235	ingress:
				236	servicePort: 8080
				237	annotations:
				238	nginx.ingress.kubernetes.io/auth-type: basic
				239	nginx.ingress.kubernetes.io/auth-secret: basic-auth-secret-name
				240
				241	In the example above, we are using the ``basic-auth-secret-name`` secret to
				242	authenticate users. The secret should be created in the same namespace as the
				243	Prometheus deployment based on the `Ingress NGINX Annotations <https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#annotations>`_.
				244
				245	IP Whitelisting
				246	~~~~~~~~~~~~~~~
				247
				248	If you want to whitelist specific IPs to access the Prometheus UI, you can do
				249	so by making the following changes to your inventory:
				250
				251	.. code-block:: yaml
				252
				253	kube_prometheus_stack_helm_values:
				254	prometheus:
				255	ingress:
				256	servicePort: 8080
				257	annotations:
				258	nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/24,172.10.0.1"
				259
				260	In the example above, we are whitelisting the IP range ``10.0.0.0/24`` and the IP address
				261	``172.10.0.1``.
				262
				263	AlertManager
				264	============
				265
				266	By default, the AlertManager dashboard is pointing to the Ansible variable
				267	``kube_prometheus_stack_alertmanager_host`` and is exposed using an ``Ingress``
				268	behind the `oauth2-proxy` service, protected by Keycloak similar to Prometheus.
				269
				270	************
				271	Integrations
				272	************
				273
				274	Since Atmosphere relies on AlertManager to send alerts, it is possible to
				275	integrate it with services like OpsGenie, PagerDuty, email and more. To
				276	receive monitoring alerts using your preferred notification tools, you'll
				277	need to integrate them with AlertManager.
				278
				279	OpsGenie
				280	========
				281
				282	In order to get started, you will need to complete the following steps inside
				283	OpsGenie:
				284
				285	1. Create an integration inside OpsGenie, you can do this by going to
				286	Settings > Integrations > Add Integration and selecting Prometheus.
				287	2. Copy the API key that is generated for you and setup correct assignment
				288	rules inside OpsGenie.
				289	3. Create a new heartbeat inside OpsGenie, you can do this by going to
				290	Settings > Heartbeats > Create Heartbeat. Set the interval to 1 minute.
				291
				292	Afterwards, you can configure the following options for the Atmosphere config,
				293	making sure that you replace the placeholders with the correct values:
				294
				295	``API_KEY``
				296	The API key that you copied from the OpsGenie integration.
				297
				298	``HEARTBEAT_NAME``
				299	The name of the heartbeat that you created inside OpsGenie
				300
				301	.. code-block:: yaml
				302
				303	kube_prometheus_stack_helm_values:
				304	alertmanager:
				305	config:
				306	receivers:
				307	- name: "null"
				308	- name: notifier
				309	opsgenie_configs:
				310	- api_key: API_KEY
				311	message: >-
				312	{% raw -%}
				313	{{ .GroupLabels.alertname }}
				314	{%- endraw %}
				315	priority: >-
				316	{% raw -%}
				317	{{- if eq .GroupLabels.severity "critical" -}}
				318	P1
				319	{{- else if eq .GroupLabels.severity "warning" -}}
				320	P3
				321	{{- else if eq .GroupLabels.severity "info" -}}
				322	P5
				323	{{- else -}}
				324	{{ .GroupLabels.severity }}
				325	{{- end -}}
				326	{%- endraw %}
				327	description: \|-
				328	{% raw -%}
				329	{{ if gt (len .Alerts.Firing) 0 -}}
				330	Alerts Firing:
				331	{{ range .Alerts.Firing }}
				332	- Message: {{ .Annotations.message }}
				333	Labels:
				334	{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
				335	{{ end }} Annotations:
				336	{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
				337	{{ end }} Source: {{ .GeneratorURL }}
				338	{{ end }}
				339	{{- end }}
				340	{{ if gt (len .Alerts.Resolved) 0 -}}
				341	Alerts Resolved:
				342	{{ range .Alerts.Resolved }}
				343	- Message: {{ .Annotations.message }}
				344	Labels:
				345	{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
				346	{{ end }} Annotations:
				347	{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
				348	{{ end }} Source: {{ .GeneratorURL }}
				349	{{ end }}
				350	{{- end }}
				351	{%- endraw %}
				352	- name: heartbeat
				353	webhook_configs:
				354	- url: https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping
				355	send_resolved: false
				356	http_config:
				357	basic_auth:
				358	password: API_KEY
				359
				360	Once this is done and deployed, you'll start to see alerts inside OpsGenie and
				361	you can also verify that the heartbeat is listed as ACTIVE.
				362
				363	PagerDuty
				364	=========
				365
				366	To integrate with Pagerduty, first you need to prepare an Integration key. In
				367	order to do that, you must decide how you want to integrate with PagerDuty since
				368	there are two ways to do it:
				369
				370	Event Orchestration
				371	This method is beneficial if you want to build different routing rules based
				372	on the events coming from the integrated tool.
				373
				374	PagerDuty Service Integration
				375	This method is beneficial if you don't need to route alerts from the integrated
				376	tool to different responders based on the event payload.
				377
				378	For both of these methods, you need to create an Integration key in PagerDuty
				379	using the `PagerDuty Integration Guide <https://www.pagerduty.com/docs/guides/prometheus-integration-guide/>`_.
				380
				381	Once you're done, you'll need to configure the inventory with the following
				382	options:
				383
				384	.. code-block:: yaml
				385
				386	kube_prometheus_stack_helm_values:
				387	alertmanager:
				388	config:
				389	receivers:
				390	- name: notifier
				391	pagerduty_configs:
				392	- service_key: '<your integration key here>'
				393
				394	You can find more details about
				395	`pagerduty_configs <https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config>`_
				396	in the Prometheus documentation.
				397
				398	Email
				399	=====
				400
				401	To integrate with email, you need to configure the following options in the
				402	inventory:
				403
				404	.. code-block:: yaml
				405
				406	kube_prometheus_stack_helm_values:
				407	alertmanager:
				408	config:
				409	receivers:
				410	- name: notifier
				411	email_configs:
				412	- smarthost: 'smtp.gmail.com:587'
				413	auth_username: '<your email id here>'
				414	auth_password: '<your email password here>'
				415	from: '<your email id here>'
				416	to: '<receiver's email id here>'
				417	headers:
				418	subject: 'Prometheus Mail Alerts'
				419
				420	You can find more details about
				421	`email_configs <https://prometheus.io/docs/alerting/latest/configuration/#email_configs>`_
				422	in the Prometheus documentation.
				423
				424	****************
				425	Alerts Reference
				426	****************
				427
				428	``etcdDatabaseHighFragmentationRatio``
				429	This alert is triggered when the etcd database has a high fragmentation ratio
				430	which can cause performance issues on the cluster. In order to resolve this
				431	issue, you can use the following command:
				432
				433	.. code-block:: console
				434
				435	kubectl -n kube-system exec svc/kube-prometheus-stack-kube-etcd -- \
				436	etcdctl defrag \
				437	--cluster \
				438	--cacert /etc/kubernetes/pki/etcd/ca.crt \
				439	--key /etc/kubernetes/pki/etcd/server.key \
				440	--cert /etc/kubernetes/pki/etcd/server.crt
				441
				442	``NodeNetworkMulticast``
				443	This alert is triggered when a node is receiving large volumes of multicast
				444	traffic which can be a sign of a misconfigured network or a malicious actor.
				445
				446	This can result in high CPU usage on the node and can cause the node to become
				447	unresponsive. Also, it can be the cause of a very high amount of software
				448	interrupts on the node.
				449
				450	In order to find the root cause of this issue, you can use the following
				451	commands:
				452
				453	.. code-block:: console
				454
				455	iftop -ni $DEV -f 'multicast and not broadcast'
				456
				457	With the command above, you're able to see which IP addresses are sending the
				458	multicast traffic. Once you have the IP address, you can use the following
				459	command to find the server behind it:
				460
				461	.. code-block:: console
				462
				463	openstack server list --all-projects --long -n --ip $IP
ricolin	9b28d85	2025-03-04 16:13:47 +0800	[diff] [blame]	464
				465	``EtcdMembersDown``
				466	If any alarms are fired from Promethetus for ``etcd`` issues such as ``TargetDown``,
				467	``etcdMembersDown``, or ``etcdInsufficientMembers``), it could be due to expired
				468	certificates. You can update the certificates that ``kube-prometheus-stack`` uses for
				469	talking with ``etcd`` with the following commands:
				470
				471	.. code-block:: console
				472
				473	kubectl -n monitoring delete secret/kube-prometheus-stack-etcd-client-cert
				474	kubectl -n monitoring create secret generic kube-prometheus-stack-etcd-client-cert \
				475	--from-file=/etc/kubernetes/pki/etcd/ca.crt \
				476	--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
				477	--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key