Blame - doc/source/admin/troubleshooting.rst - atmosphere

blob: 49f247a973a80614ec861780d5f9ac0747775f86 [file] [log] [blame]

Mohammed Naser	a546734	2024-04-30 00:34:16 -0400	[diff] [blame]	1	#####################
				2	Troubleshooting Guide
				3	#####################
				4
				5	This document aims to provide solutions to common issues encountered during the deployment and operation of Atmosphere. The guide is organized by component and issue type to help you quickly find the most relevant information.
				6
				7	**************************
				8	Open Virtual Network (OVN)
				9	**************************
				10
				11	Recovering clusters
				12	===================
				13
				14	If any of the OVN database pods fail, they will no longer be ready. You can
				15	recover the cluster by deleting the pods and allowing them to be recreated.
				16
				17	For example, if the ``ovn-ovsdb-nb-0`` pod fails, you can recover the cluster by
				18	deleting the pod:
				19
				20	.. code-block:: console
				21
				22	$ kubectl -n openstack delete pods/ovn-ovsdb-nb-0
				23
				24	If the entire cluster fails, you can recover the cluster by deleting all of the
				25	pods. For example, if the southbound database fails, you can recover the
				26	cluster with this command:
				27
				28	.. code-block:: console
				29
				30	$ kubectl -n openstack delete pods -lcomponent=ovn-ovsdb-sb
				31
				32	If the state of Neutron is lost from the cluster, you can recover it by running
				33	the repair command:
				34
				35	.. code-block:: console
				36
				37	$ kubectl -n openstack exec deploy/neutron-server -- \
				38	neutron-ovn-db-sync-util \
				39	--debug \
				40	--config-file /etc/neutron/neutron.conf \
				41	--config-file /tmp/pod-shared/ovn.ini \
				42	--config-file /etc/neutron/plugins/ml2/ml2_conf.ini \
				43	--ovn-neutron_sync_mode repair
				44
				45	**********************
				46	Compute Service (Nova)
				47	**********************
				48
				49	Provisioning Failure Due to ``downloading`` Volume
				50	==================================================
				51
				52	If you're trying to provision a new instance that is using a volume where the
				53	backend needs to download images directly from Glance (such as PowerStore for
				54	example) and it fails with the following error:
				55
				56	.. code-block:: text
				57
				58	Build of instance 54a41735-a4cb-4312-b812-52e4f3d8c500 aborted: Volume 728bdc40-fc22-4b65-b6b6-c94ee7f98ff0 did not finish being created even after we waited 187 seconds or 61 attempts. And its status is downloading.
				59
				60	This means that the volume service could not download the image before the
				61	compute service timed out. Out of the box, Atmosphere ships with the volume
				62	cache enabled to help offset this issue. However, if you're using a backend
				63	that does not support the volume cache, you can increase the timeout by setting
				64	the following in your ``inventory/group_vars/all/nova.yml`` file:
				65
				66	.. code-block:: yaml
				67
				68	nova_helm_values:
				69	conf:
				70	enable_iscsi: true
				71	nova:
				72	DEFAULT:
				73	block_device_allocate_retries: 300
				74
				75	*******************************
				76	Load Balancer Service (Octavia)
				77	*******************************
				78
				79	Accessing Amphorae
				80	==================
				81
				82	Atmosphere configures an SSH keypair which allows you to login to the Amphorae
				83	for debugging purposes. The ``octavia-worker`` containers are fully configured
				84	to allow you to SSH to the Amphorae.
				85
				86	If you have an Amphora running with the IP address ``172.24.0.148``, you can login
				87	to it by simply executing the following:
				88
				89	.. code-block:: console
				90
				91	$ kubectl -n openstack exec -it deploy/octavia-worker -- ssh 172.24.0.148
				92
				93
				94	Listener with ``provisioning_status`` stuck in ``ERROR``
				95	========================================================
				96
				97	There are scenarios where the load balancer could be in an ``ACTIVE`` state however
				98	the listener can be stuck in a ``provisioning_status`` of ``ERROR``. This is usually
				99	related to an expired TLS certificate not cleanly recovering.
				100
				101	Another symptom of this issue will be that you'll see the following inside the
				102	``octavia-worker`` logs:
				103
				104	.. code-block:: text
				105
				106	ERROR oslo_messaging.rpc.server [None req-ad303faf-7a53-4c55-94a5-28cd61c46619 - e83856ceda5c42df8810df42fef8fc1c - - - -] Exception during message handling: octavia.amphorae.drivers.haproxy.exceptions.InternalServerError: Internal Server Erro
				107	ERROR oslo_messaging.rpc.server Traceback (most recent call last):
				108	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
				109	ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
				110	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
				111	ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
				112	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
				113	ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
				114	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/queue/v2/endpoints.py", line 90, in update_pool
				115	ERROR oslo_messaging.rpc.server self.worker.update_pool(original_pool, pool_updates)
				116	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/controller_worker.py", line 733, in update_pool
				117	ERROR oslo_messaging.rpc.server self.run_flow(
				118	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/controller_worker.py", line 113, in run_flow
				119	ERROR oslo_messaging.rpc.server tf.run()
				120	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/engine.py", line 247, in run
				121	ERROR oslo_messaging.rpc.server for _state in self.run_iter(timeout=timeout):
				122	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/engine.py", line 340, in run_iter
				123	ERROR oslo_messaging.rpc.server failure.Failure.reraise_if_any(er_failures)
				124	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/types/failure.py", line 338, in reraise_if_any
				125	ERROR oslo_messaging.rpc.server failures[0].reraise()
				126	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/types/failure.py", line 350, in reraise
				127	ERROR oslo_messaging.rpc.server raise value
				128	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/executor.py", line 52, in _execute_task
				129	ERROR oslo_messaging.rpc.server result = task.execute(**arguments)
				130	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py", line 157, in execute
				131	ERROR oslo_messaging.rpc.server self.amphora_driver.update(loadbalancer)
				132	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 236, in update
				133	ERROR oslo_messaging.rpc.server self.update_amphora_listeners(loadbalancer, amphora)
				134	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 205, in update_amphora_listeners
				135	ERROR oslo_messaging.rpc.server self.clients[amphora.api_version].upload_config(
				136	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 758, in upload_config
				137	ERROR oslo_messaging.rpc.server return exc.check_exception(r)
				138	ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 44, in check_exception
				139	ERROR oslo_messaging.rpc.server raise responses[status_code]()
				140	ERROR oslo_messaging.rpc.server octavia.amphorae.drivers.haproxy.exceptions.InternalServerError: Internal Server Error
				141
				142	You can simply trigger a complete failover of the load balancer which will solve
				143	the issue:
				144
				145	.. code-block:: console
				146
				147	$ openstack loadbalancer failover ${LOAD_BALANCER_ID}
				148
				149	.. admonition:: Help us improve Atmosphere!
				150	:class: info
				151
				152	We're trying to collect data with when these failures occur to better understand
				153	the root cause. If you encounter this issue, please reach out to the Atmosphere
				154	team so we can better understand the issue by filing an issue with the output of
				155	the ``amphora-agent`` logs from the Amphora.