Mohammed Naser | 90128aa | 2024-04-29 13:21:58 -0400 | [diff] [blame] | 1 | ######################### |
| 2 | Monitoring and Operations |
| 3 | ######################### |
| 4 | |
| 5 | There is a Grafana deployment with a few dashboards that are created by default |
| 6 | and a Prometheus deployment that is used to collect metrics from the cluster |
| 7 | which sends alerts to AlertManager. In addition, Loki is deployed to collect |
| 8 | logs from the cluster using Vector. |
| 9 | |
| 10 | ****************************** |
| 11 | Philosophy and Alerting Levels |
| 12 | ****************************** |
| 13 | |
| 14 | Atmosphere's monitoring philosophy is strongly aligned with the principles |
| 15 | outlined in the Google Site Reliability Engineering (SRE) book. Our approach |
| 16 | focuses on alerting on conditions that are symptomatic of issues which directly |
| 17 | impact the service or system health, rather than simply monitoring the state of |
| 18 | individual components. |
| 19 | |
| 20 | Alerting Philosophy |
| 21 | =================== |
| 22 | |
| 23 | Our alerting philosophy aims to alert the right people at the right time. Most |
| 24 | alerts, if they are affecting a single system, would trigger a lower priority |
| 25 | level (P4 or P5). However, if an issue is affecting the entire control plane of |
| 26 | a specific service, it might escalate to a P3 or P2. And if the whole service |
| 27 | is unavailable, it becomes a P1. |
| 28 | |
| 29 | We believe in minimizing alert noise to ensure that alerts are meaningful and |
| 30 | actionable. Our goal is to have every alert provide enough information to |
| 31 | initiate an immediate and effective response, regardless of business hours for |
| 32 | high priority alerts. |
| 33 | |
| 34 | We continue to refine our monitoring and alerting strategies to ensure that we |
| 35 | are effectively identifying and responding to incidents. The ultimate goal is |
| 36 | to provide a reliable and high-quality service to all our users. |
| 37 | |
| 38 | Severity Levels |
| 39 | =============== |
| 40 | |
| 41 | Our alerting system classifies incidents into different severity levels based on |
| 42 | their impact on the system and users. |
| 43 | |
| 44 | **P1**: Critical |
| 45 | This level is used for incidents causing a complete service disruption or |
| 46 | significant loss of functionality across the entire Atmosphere platform. |
| 47 | Immediate response, attention, and action are necessary regardless of |
| 48 | business hours. |
| 49 | |
| 50 | **P2**: High |
| 51 | This level is for incidents that affect a large group of users or critical |
| 52 | system components. These incidents require swift attention and action, |
| 53 | regardless of business hours, but do not cause a total disruption. |
| 54 | |
| 55 | **P3**: Moderate |
| 56 | This level is for incidents that affect a smaller group of users or a single |
| 57 | system. These incidents require attention and may necessitate action during |
| 58 | business hours. |
| 59 | |
| 60 | **P4**: Low |
| 61 | This level is used for minor issues that have a limited impact on a small |
| 62 | subset of users or system functionality. These incidents require attention |
| 63 | and action, if necessary, during standard business hours. |
| 64 | |
| 65 | **P5**: Informational |
| 66 | This is the lowest level of severity, used for providing information about |
| 67 | normal system activities or minor issues that don't significantly impact |
| 68 | users or system functionality. These incidents typically do not require |
| 69 | immediate attention or action and are addressed during standard business |
| 70 | hours. |
| 71 | |
| 72 | ********************** |
| 73 | Operational Procedures |
| 74 | ********************** |
| 75 | |
| 76 | Creating silences |
| 77 | ================= |
| 78 | |
| 79 | In order to create a silence, you'll need to login to your Grafana instance that |
| 80 | is deployed as part of Atmosphere as an admin user. |
| 81 | |
| 82 | 1. Click on the hamburger menu in the top left corner and select "Alerting" |
| 83 | and then "Silences" from the menu. |
| 84 | |
| 85 | .. image:: images/monitoring-silences-menu.png |
| 86 | :alt: Silences menu |
| 87 | :width: 200 |
| 88 | |
| 89 | 2. Ensure that you select "AlertManager" on the top right corner of the page, |
| 90 | this will make sure that you create a silence inside of the AlertManager |
| 91 | that is managed by the Prometheus operator instead of the built-in Grafana |
| 92 | AlertManager which is not used. |
| 93 | |
| 94 | .. image:: images/monitoring-alertmanger-list.png |
| 95 | :alt: AlertManager list |
| 96 | :width: 200 |
| 97 | |
| 98 | .. admonition:: AlertManager selection |
| 99 | :class: warning |
| 100 | |
| 101 | It's important that you select the AlertManager that is managed by the |
| 102 | Prometheus operator, otherwise your silence will not be applied to the |
| 103 | Prometheus instance that is deployed as part of Atmosphere. |
| 104 | |
| 105 | 3. Click the "Add Silence" button and use the AlertManager format to create |
| 106 | your silence, which you can test by seeing if it matches any alerts in the |
| 107 | list labeled "Affected alert instances". |
| 108 | |
| 109 | .. admonition:: Limit the number of labels |
| 110 | :class: info |
| 111 | |
| 112 | It is important to limit the number of labels that you use in your silence |
| 113 | to ensure that it will continue to work even if the alerts are modified. |
| 114 | |
| 115 | For example, if you have an alert that is labeled with the following labels: |
| 116 | |
| 117 | - ``alertname`` |
| 118 | - ``instance`` |
| 119 | - ``job`` |
| 120 | - ``severity`` |
| 121 | |
| 122 | You should only use the ``alertname`` and ``severity`` labels in your |
| 123 | silence to ensure that it will continue to work even if the ``instance`` |
| 124 | or ``job`` labels are modified. |
| 125 | |
| 126 | ************** |
| 127 | Configurations |
| 128 | ************** |
| 129 | |
| 130 | Dashboard Management |
| 131 | ==================== |
| 132 | |
| 133 | For Grafana, rather than enabling persistence through the application's user |
| 134 | interface or manual Helm chart modifications, dashboards should be managed |
| 135 | directly via the Helm chart values. |
| 136 | |
| 137 | .. admonition:: Avoid Manual Persistence Configurations! |
| 138 | :class: warning |
| 139 | |
| 140 | It is important to avoid manual persistence configurations, especially for |
| 141 | services like Grafana, where dashboards and data sources can be saved. Such |
| 142 | practices are not captured in version control and pose a risk of data loss, |
| 143 | configuration drift, and upgrade complications. |
| 144 | |
| 145 | To manage Grafana dashboards through Helm, you can include the dashboard |
| 146 | definitions within your configuration file. By doing so, you facilitate |
| 147 | version-controlled dashboard configurations that can be replicated across |
| 148 | different deployments without manual intervention. |
| 149 | |
| 150 | For example, a dashboard can be defined in the Helm values like this: |
| 151 | |
| 152 | .. code-block:: yaml |
| 153 | |
| 154 | kube_prometheus_stack_helm_values: |
| 155 | grafana: |
| 156 | dashboards: |
| 157 | default: |
| 158 | my-dashboard: |
| 159 | gnetId: 10000 |
| 160 | revision: 1 |
| 161 | datasource: Prometheus |
| 162 | |
| 163 | This instructs Helm to fetch and configure the specified dashboard from |
| 164 | `Grafana.com dashboards <https://grafana.com/grafana/dashboards/>`_, using |
| 165 | Prometheus as the data source. You can find more examples of how to do |
| 166 | this in the Grafana Helm chart `Import Dashboards <https://github.com/grafana/helm-charts/tree/main/charts/grafana#import-dashboards>`_ |
| 167 | documentation. |
| 168 | |
| 169 | ************ |
| 170 | Viewing data |
| 171 | ************ |
| 172 | |
| 173 | There are a few different ways to view the data that is collected by the |
| 174 | monitoring stack. The most common ways are through AlertManager, Grafana, and |
| 175 | Prometheus. |
| 176 | |
| 177 | Grafana dashboard |
| 178 | ================= |
| 179 | |
| 180 | By default, an ``Ingress`` is created for Grafana using the |
| 181 | ``kube_prometheus_stack_grafana_host`` variable. The authentication is done |
| 182 | using the Keycloak service which is deployed by default. |
| 183 | |
| 184 | Inside Keycloak, there are two client roles that are created for Grafana: |
| 185 | |
| 186 | ``grafana:admin`` |
| 187 | Has access to all organization resources, including dashboards, users, and |
| 188 | teams. |
| 189 | |
| 190 | ``grafana:editor`` |
| 191 | Can view and edit dashboards, folders, and playlists. |
| 192 | |
| 193 | ``grafana:viewer`` |
| 194 | Can view dashboards, playlists, and query data sources. |
| 195 | |
| 196 | You can view the existing dashboards by going to *Manage* > *Dashboards*. You |
| 197 | can also check any alerts that are currently firing by going to *Alerting* > |
| 198 | *Alerts*. |
| 199 | |
| 200 | Prometheus |
| 201 | ========== |
| 202 | |
| 203 | By default, Prometheus is exposed behind an ``Ingress`` using the |
| 204 | ``kube_prometheus_stack_prometheus_host`` variable. In addition, it is also |
| 205 | running behind the `oauth2-proxy` service which is used for authentication |
| 206 | so that only authenticated users can access the Prometheus UI. |
| 207 | |
| 208 | Alternative Authentication |
| 209 | -------------------------- |
| 210 | |
| 211 | It is possible to by-pass the `oauth2-proxy` service and use an alternative |
| 212 | authentication method to access the Prometheus UI. In both cases, we will |
| 213 | be overriding the ``servicePort`` on the ``Ingress`` to point to the port |
| 214 | where Prometheus is running and not the `oauth2-proxy` service. |
| 215 | |
| 216 | .. admonition:: Advanced Usage Only |
| 217 | :class: warning |
| 218 | |
| 219 | It's strongly recommended that you stick to keeping the `oauth2-proxy` |
| 220 | service in front of the Prometheus UI. The `oauth2-proxy` service is |
| 221 | responsible for authenticating users and ensuring that only authenticated |
| 222 | users can access the Prometheus UI. |
| 223 | |
| 224 | Basic Authentication |
| 225 | ~~~~~~~~~~~~~~~~~~~~ |
| 226 | |
| 227 | If you want to rely on basic authentication to access the Prometheus UI instead |
| 228 | of using the `oauth2-proxy` service to expose it over single sign-on, you can |
| 229 | do so by making the following changes to your inventory: |
| 230 | |
| 231 | .. code-block:: yaml |
| 232 | |
| 233 | kube_prometheus_stack_helm_values: |
| 234 | prometheus: |
| 235 | ingress: |
| 236 | servicePort: 8080 |
| 237 | annotations: |
| 238 | nginx.ingress.kubernetes.io/auth-type: basic |
| 239 | nginx.ingress.kubernetes.io/auth-secret: basic-auth-secret-name |
| 240 | |
| 241 | In the example above, we are using the ``basic-auth-secret-name`` secret to |
| 242 | authenticate users. The secret should be created in the same namespace as the |
| 243 | Prometheus deployment based on the `Ingress NGINX Annotations <https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#annotations>`_. |
| 244 | |
| 245 | IP Whitelisting |
| 246 | ~~~~~~~~~~~~~~~ |
| 247 | |
| 248 | If you want to whitelist specific IPs to access the Prometheus UI, you can do |
| 249 | so by making the following changes to your inventory: |
| 250 | |
| 251 | .. code-block:: yaml |
| 252 | |
| 253 | kube_prometheus_stack_helm_values: |
| 254 | prometheus: |
| 255 | ingress: |
| 256 | servicePort: 8080 |
| 257 | annotations: |
| 258 | nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/24,172.10.0.1" |
| 259 | |
| 260 | In the example above, we are whitelisting the IP range ``10.0.0.0/24`` and the IP address |
| 261 | ``172.10.0.1``. |
| 262 | |
| 263 | AlertManager |
| 264 | ============ |
| 265 | |
| 266 | By default, the AlertManager dashboard is pointing to the Ansible variable |
| 267 | ``kube_prometheus_stack_alertmanager_host`` and is exposed using an ``Ingress`` |
| 268 | behind the `oauth2-proxy` service, protected by Keycloak similar to Prometheus. |
| 269 | |
| 270 | ************ |
| 271 | Integrations |
| 272 | ************ |
| 273 | |
| 274 | Since Atmosphere relies on AlertManager to send alerts, it is possible to |
| 275 | integrate it with services like OpsGenie, PagerDuty, email and more. To |
| 276 | receive monitoring alerts using your preferred notification tools, you'll |
| 277 | need to integrate them with AlertManager. |
| 278 | |
| 279 | OpsGenie |
| 280 | ======== |
| 281 | |
| 282 | In order to get started, you will need to complete the following steps inside |
| 283 | OpsGenie: |
| 284 | |
| 285 | 1. Create an integration inside OpsGenie, you can do this by going to |
| 286 | *Settings* > *Integrations* > *Add Integration* and selecting *Prometheus*. |
| 287 | 2. Copy the API key that is generated for you and setup correct assignment |
| 288 | rules inside OpsGenie. |
| 289 | 3. Create a new heartbeat inside OpsGenie, you can do this by going to |
| 290 | *Settings* > *Heartbeats* > *Create Heartbeat*. Set the interval to 1 minute. |
| 291 | |
| 292 | Afterwards, you can configure the following options for the Atmosphere config, |
| 293 | making sure that you replace the placeholders with the correct values: |
| 294 | |
| 295 | ``API_KEY`` |
| 296 | The API key that you copied from the OpsGenie integration. |
| 297 | |
| 298 | ``HEARTBEAT_NAME`` |
| 299 | The name of the heartbeat that you created inside OpsGenie |
| 300 | |
| 301 | .. code-block:: yaml |
| 302 | |
| 303 | kube_prometheus_stack_helm_values: |
| 304 | alertmanager: |
| 305 | config: |
| 306 | receivers: |
| 307 | - name: "null" |
| 308 | - name: notifier |
| 309 | opsgenie_configs: |
| 310 | - api_key: API_KEY |
| 311 | message: >- |
| 312 | {% raw -%} |
| 313 | {{ .GroupLabels.alertname }} |
| 314 | {%- endraw %} |
| 315 | priority: >- |
| 316 | {% raw -%} |
| 317 | {{- if eq .GroupLabels.severity "critical" -}} |
| 318 | P1 |
| 319 | {{- else if eq .GroupLabels.severity "warning" -}} |
| 320 | P3 |
| 321 | {{- else if eq .GroupLabels.severity "info" -}} |
| 322 | P5 |
| 323 | {{- else -}} |
| 324 | {{ .GroupLabels.severity }} |
| 325 | {{- end -}} |
| 326 | {%- endraw %} |
| 327 | description: |- |
| 328 | {% raw -%} |
| 329 | {{ if gt (len .Alerts.Firing) 0 -}} |
| 330 | Alerts Firing: |
| 331 | {{ range .Alerts.Firing }} |
| 332 | - Message: {{ .Annotations.message }} |
| 333 | Labels: |
| 334 | {{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }} |
| 335 | {{ end }} Annotations: |
| 336 | {{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }} |
| 337 | {{ end }} Source: {{ .GeneratorURL }} |
| 338 | {{ end }} |
| 339 | {{- end }} |
| 340 | {{ if gt (len .Alerts.Resolved) 0 -}} |
| 341 | Alerts Resolved: |
| 342 | {{ range .Alerts.Resolved }} |
| 343 | - Message: {{ .Annotations.message }} |
| 344 | Labels: |
| 345 | {{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }} |
| 346 | {{ end }} Annotations: |
| 347 | {{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }} |
| 348 | {{ end }} Source: {{ .GeneratorURL }} |
| 349 | {{ end }} |
| 350 | {{- end }} |
| 351 | {%- endraw %} |
| 352 | - name: heartbeat |
| 353 | webhook_configs: |
| 354 | - url: https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping |
| 355 | send_resolved: false |
| 356 | http_config: |
| 357 | basic_auth: |
| 358 | password: API_KEY |
| 359 | |
| 360 | Once this is done and deployed, you'll start to see alerts inside OpsGenie and |
| 361 | you can also verify that the heartbeat is listed as *ACTIVE*. |
| 362 | |
| 363 | PagerDuty |
| 364 | ========= |
| 365 | |
| 366 | To integrate with Pagerduty, first you need to prepare an *Integration key*. In |
| 367 | order to do that, you must decide how you want to integrate with PagerDuty since |
| 368 | there are two ways to do it: |
| 369 | |
| 370 | **Event Orchestration** |
| 371 | This method is beneficial if you want to build different routing rules based |
| 372 | on the events coming from the integrated tool. |
| 373 | |
| 374 | **PagerDuty Service Integration** |
| 375 | This method is beneficial if you don't need to route alerts from the integrated |
| 376 | tool to different responders based on the event payload. |
| 377 | |
| 378 | For both of these methods, you need to create an *Integration key* in PagerDuty |
| 379 | using the `PagerDuty Integration Guide <https://www.pagerduty.com/docs/guides/prometheus-integration-guide/>`_. |
| 380 | |
| 381 | Once you're done, you'll need to configure the inventory with the following |
| 382 | options: |
| 383 | |
| 384 | .. code-block:: yaml |
| 385 | |
| 386 | kube_prometheus_stack_helm_values: |
| 387 | alertmanager: |
| 388 | config: |
| 389 | receivers: |
| 390 | - name: notifier |
| 391 | pagerduty_configs: |
| 392 | - service_key: '<your integration key here>' |
| 393 | |
| 394 | You can find more details about |
| 395 | `pagerduty_configs <https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config>`_ |
| 396 | in the Prometheus documentation. |
| 397 | |
| 398 | Email |
| 399 | ===== |
| 400 | |
| 401 | To integrate with email, you need to configure the following options in the |
| 402 | inventory: |
| 403 | |
| 404 | .. code-block:: yaml |
| 405 | |
| 406 | kube_prometheus_stack_helm_values: |
| 407 | alertmanager: |
| 408 | config: |
| 409 | receivers: |
| 410 | - name: notifier |
| 411 | email_configs: |
| 412 | - smarthost: 'smtp.gmail.com:587' |
| 413 | auth_username: '<your email id here>' |
| 414 | auth_password: '<your email password here>' |
| 415 | from: '<your email id here>' |
| 416 | to: '<receiver's email id here>' |
| 417 | headers: |
| 418 | subject: 'Prometheus Mail Alerts' |
| 419 | |
| 420 | You can find more details about |
| 421 | `email_configs <https://prometheus.io/docs/alerting/latest/configuration/#email_configs>`_ |
| 422 | in the Prometheus documentation. |
| 423 | |
| 424 | **************** |
| 425 | Alerts Reference |
| 426 | **************** |
| 427 | |
| 428 | ``etcdDatabaseHighFragmentationRatio`` |
| 429 | This alert is triggered when the etcd database has a high fragmentation ratio |
| 430 | which can cause performance issues on the cluster. In order to resolve this |
| 431 | issue, you can use the following command: |
| 432 | |
| 433 | .. code-block:: console |
| 434 | |
| 435 | kubectl -n kube-system exec svc/kube-prometheus-stack-kube-etcd -- \ |
| 436 | etcdctl defrag \ |
| 437 | --cluster \ |
| 438 | --cacert /etc/kubernetes/pki/etcd/ca.crt \ |
| 439 | --key /etc/kubernetes/pki/etcd/server.key \ |
| 440 | --cert /etc/kubernetes/pki/etcd/server.crt |
| 441 | |
| 442 | ``NodeNetworkMulticast`` |
| 443 | This alert is triggered when a node is receiving large volumes of multicast |
| 444 | traffic which can be a sign of a misconfigured network or a malicious actor. |
| 445 | |
| 446 | This can result in high CPU usage on the node and can cause the node to become |
| 447 | unresponsive. Also, it can be the cause of a very high amount of software |
| 448 | interrupts on the node. |
| 449 | |
| 450 | In order to find the root cause of this issue, you can use the following |
| 451 | commands: |
| 452 | |
| 453 | .. code-block:: console |
| 454 | |
| 455 | iftop -ni $DEV -f 'multicast and not broadcast' |
| 456 | |
| 457 | With the command above, you're able to see which IP addresses are sending the |
| 458 | multicast traffic. Once you have the IP address, you can use the following |
| 459 | command to find the server behind it: |
| 460 | |
| 461 | .. code-block:: console |
| 462 | |
| 463 | openstack server list --all-projects --long -n --ip $IP |
ricolin | 9b28d85 | 2025-03-04 16:13:47 +0800 | [diff] [blame] | 464 | |
| 465 | ``EtcdMembersDown`` |
| 466 | If any alarms are fired from Promethetus for ``etcd`` issues such as ``TargetDown``, |
| 467 | ``etcdMembersDown``, or ``etcdInsufficientMembers``), it could be due to expired |
| 468 | certificates. You can update the certificates that ``kube-prometheus-stack`` uses for |
| 469 | talking with ``etcd`` with the following commands: |
| 470 | |
| 471 | .. code-block:: console |
| 472 | |
| 473 | kubectl -n monitoring delete secret/kube-prometheus-stack-etcd-client-cert |
| 474 | kubectl -n monitoring create secret generic kube-prometheus-stack-etcd-client-cert \ |
| 475 | --from-file=/etc/kubernetes/pki/etcd/ca.crt \ |
| 476 | --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ |
| 477 | --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key |