blob: 49f247a973a80614ec861780d5f9ac0747775f86 [file] [log] [blame]
Mohammed Nasera5467342024-04-30 00:34:16 -04001#####################
2Troubleshooting Guide
3#####################
4
5This document aims to provide solutions to common issues encountered during the deployment and operation of Atmosphere. The guide is organized by component and issue type to help you quickly find the most relevant information.
6
7**************************
8Open Virtual Network (OVN)
9**************************
10
11Recovering clusters
12===================
13
14If any of the OVN database pods fail, they will no longer be ready. You can
15recover the cluster by deleting the pods and allowing them to be recreated.
16
17For example, if the ``ovn-ovsdb-nb-0`` pod fails, you can recover the cluster by
18deleting the pod:
19
20.. code-block:: console
21
22 $ kubectl -n openstack delete pods/ovn-ovsdb-nb-0
23
24If the entire cluster fails, you can recover the cluster by deleting all of the
25pods. For example, if the southbound database fails, you can recover the
26cluster with this command:
27
28.. code-block:: console
29
30 $ kubectl -n openstack delete pods -lcomponent=ovn-ovsdb-sb
31
32If the state of Neutron is lost from the cluster, you can recover it by running
33the repair command:
34
35.. code-block:: console
36
37 $ kubectl -n openstack exec deploy/neutron-server -- \
38 neutron-ovn-db-sync-util \
39 --debug \
40 --config-file /etc/neutron/neutron.conf \
41 --config-file /tmp/pod-shared/ovn.ini \
42 --config-file /etc/neutron/plugins/ml2/ml2_conf.ini \
43 --ovn-neutron_sync_mode repair
44
45**********************
46Compute Service (Nova)
47**********************
48
49Provisioning Failure Due to ``downloading`` Volume
50==================================================
51
52If you're trying to provision a new instance that is using a volume where the
53backend needs to download images directly from Glance (such as PowerStore for
54example) and it fails with the following error:
55
56.. code-block:: text
57
58 Build of instance 54a41735-a4cb-4312-b812-52e4f3d8c500 aborted: Volume 728bdc40-fc22-4b65-b6b6-c94ee7f98ff0 did not finish being created even after we waited 187 seconds or 61 attempts. And its status is downloading.
59
60This means that the volume service could not download the image before the
61compute service timed out. Out of the box, Atmosphere ships with the volume
62cache enabled to help offset this issue. However, if you're using a backend
63that does not support the volume cache, you can increase the timeout by setting
64the following in your ``inventory/group_vars/all/nova.yml`` file:
65
66.. code-block:: yaml
67
68 nova_helm_values:
69 conf:
70 enable_iscsi: true
71 nova:
72 DEFAULT:
73 block_device_allocate_retries: 300
74
75*******************************
76Load Balancer Service (Octavia)
77*******************************
78
79Accessing Amphorae
80==================
81
82Atmosphere configures an SSH keypair which allows you to login to the Amphorae
83for debugging purposes. The ``octavia-worker`` containers are fully configured
84to allow you to SSH to the Amphorae.
85
86If you have an Amphora running with the IP address ``172.24.0.148``, you can login
87to it by simply executing the following:
88
89.. code-block:: console
90
91 $ kubectl -n openstack exec -it deploy/octavia-worker -- ssh 172.24.0.148
92
93
94Listener with ``provisioning_status`` stuck in ``ERROR``
95========================================================
96
97There are scenarios where the load balancer could be in an ``ACTIVE`` state however
98the listener can be stuck in a ``provisioning_status`` of ``ERROR``. This is usually
99related to an expired TLS certificate not cleanly recovering.
100
101Another symptom of this issue will be that you'll see the following inside the
102``octavia-worker`` logs:
103
104.. code-block:: text
105
106 ERROR oslo_messaging.rpc.server [None req-ad303faf-7a53-4c55-94a5-28cd61c46619 - e83856ceda5c42df8810df42fef8fc1c - - - -] Exception during message handling: octavia.amphorae.drivers.haproxy.exceptions.InternalServerError: Internal Server Erro
107 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
108 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
109 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
110 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
111 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
112 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
113 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
114 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/queue/v2/endpoints.py", line 90, in update_pool
115 ERROR oslo_messaging.rpc.server self.worker.update_pool(original_pool, pool_updates)
116 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/controller_worker.py", line 733, in update_pool
117 ERROR oslo_messaging.rpc.server self.run_flow(
118 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/controller_worker.py", line 113, in run_flow
119 ERROR oslo_messaging.rpc.server tf.run()
120 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/engine.py", line 247, in run
121 ERROR oslo_messaging.rpc.server for _state in self.run_iter(timeout=timeout):
122 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/engine.py", line 340, in run_iter
123 ERROR oslo_messaging.rpc.server failure.Failure.reraise_if_any(er_failures)
124 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/types/failure.py", line 338, in reraise_if_any
125 ERROR oslo_messaging.rpc.server failures[0].reraise()
126 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/types/failure.py", line 350, in reraise
127 ERROR oslo_messaging.rpc.server raise value
128 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/taskflow/engines/action_engine/executor.py", line 52, in _execute_task
129 ERROR oslo_messaging.rpc.server result = task.execute(**arguments)
130 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py", line 157, in execute
131 ERROR oslo_messaging.rpc.server self.amphora_driver.update(loadbalancer)
132 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 236, in update
133 ERROR oslo_messaging.rpc.server self.update_amphora_listeners(loadbalancer, amphora)
134 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 205, in update_amphora_listeners
135 ERROR oslo_messaging.rpc.server self.clients[amphora.api_version].upload_config(
136 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 758, in upload_config
137 ERROR oslo_messaging.rpc.server return exc.check_exception(r)
138 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.10/site-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 44, in check_exception
139 ERROR oslo_messaging.rpc.server raise responses[status_code]()
140 ERROR oslo_messaging.rpc.server octavia.amphorae.drivers.haproxy.exceptions.InternalServerError: Internal Server Error
141
142You can simply trigger a complete failover of the load balancer which will solve
143the issue:
144
145.. code-block:: console
146
147 $ openstack loadbalancer failover ${LOAD_BALANCER_ID}
148
149.. admonition:: Help us improve Atmosphere!
150 :class: info
151
152 We're trying to collect data with when these failures occur to better understand
153 the root cause. If you encounter this issue, please reach out to the Atmosphere
154 team so we can better understand the issue by filing an issue with the output of
155 the ``amphora-agent`` logs from the Amphora.