На странице приведен список всех алертов мониторинга в Deckhouse Kubernetes Platform.

Алерты сгруппированы по модулям. Справа от названия алерта указаны иконки минимальной редакции DKP в которой доступен алерт и уровня критичности алерта (severity).

Для каждого алерта приведено краткое описание (summary), раскрыв которое можно увидеть подробное описание алерта (description), при их наличии.

Модуль admission-policy-engine

  • D8AdmissionPolicyEngineNotBootstrapped CE S7
    Admission policy engine module hasn't been bootstrapped for 10 minutes.

    Admission policy engine module couldn’t bootstrap. Please check that the module’s components are up and running: kubectl get pods -n d8-admission-policy-engine. Also, it makes sense to check relevant logs in case there are missing constraint templates or not all CRD were created: kubectl logs -n d8-system -lapp=deckhouse --tail=1000 | grep admission-policy-engine

  • OperationPolicyViolation CE S7
    At least one object violates configured cluster Operation Policies.

    You have configured OperationPolicy for the cluster.

    You can find existing objects violating policies by running count by (violating_namespace, violating_kind, violating_name, violation_msg) (d8_gatekeeper_exporter_constraint_violations{violation_enforcement="deny",source_type="OperationPolicy"}) prometheus query or via the Admission policy engine Grafana dashboard.

  • PodSecurityStandardsViolation CE S7
    At least one pod violates configured cluster pod security standards.

    You have configured pod security standards (https://kubernetes.io/docs/concepts/security/pod-security-standards/).

    You can find already Running pods which are violate standards by running count by (violating_namespace, violating_name, violation_msg) (d8_gatekeeper_exporter_constraint_violations{violation_enforcement="deny",violating_namespace=~".*",violating_kind="Pod",source_type="PSS"}) prometheus query or via the Admission policy engine grafana dashboard.

  • SecurityPolicyViolation CE S7
    At least one object violates configured cluster Security Policies.

    You have configured SecurityPolicy for the cluster.

    You can find existing objects violating policies by running count by (violating_namespace, violating_kind, violating_name, violation_msg) (d8_gatekeeper_exporter_constraint_violations{violation_enforcement="deny",source_type="SecurityPolicy"}) prometheus query or via the Admission policy engine Grafana dashboard.

Модуль cert-manager

  • CertmanagerCertificateExpired CE S4
    Certificate expired

    Certificate {{$labels.exported_namespace}}/{{$labels.name}} expired

  • CertmanagerCertificateExpiredSoon CE S4
    Certificate will expire soon

    The certificate {{$labels.exported_namespace}}/{{$labels.name}} will expire in less than 2 weeks

  • CertmanagerCertificateOrderErrors CE S5
    Certmanager cannot order a certificate.

    Cermanager receives responses with the code {{ $labels.status }} on requesting {{ $labels.scheme }}://{{ $labels.host }}{{ $labels.path }}.

    It can affect certificates ordering and prolongation. Check certmanager logs for more info. kubectl -n d8-cert-manager logs -l app=cert-manager -c cert-manager

Модуль chrony

  • NodeTimeOutOfSync CE S5
    Node's {{$labels.node}} clock is drifting.

    Node’s {{$labels.node}} time is out of sync from ntp server by {{ $value }} seconds.

  • NTPDaemonOnNodeDoesNotSynchronizeTime CE S5
    NTP daemon on node {{$labels.node}} have not synchronized time for too long.
    1. check if Chrony pod is running on the node by executing the following command:
      • ‘kubectl -n d8-chrony –field-selector spec.nodeName=”{{$labels.node}}” get pods’
    2. check the Chrony daemon’s status by executing the following command:
      • ‘kubectl -n d8-chrony exec -- /opt/chrony-static/bin/chronyc sources'
    3. Correct the time synchronization problems:
      • correct network problems:
        • provide availability to upstream time synchronization servers defined in the module configuration;
        • eliminate large packet loss and excessive latency to upstream time synchronization servers.
      • Modify NTP servers list defined in the module configuration.

Модуль cloud-provider-yandex

  • D8YandexNatInstanceConnectionsQuotaUtilization CE S4
    Yandex nat-instance connections quota utilization is above 85% over the last 5 minutes.

    Nat-instance connections quota should be increased by Yandex technical support.

  • NATInstanceWithDeprecatedAvailabilityZone CE S9
    NAT Instance {{ $labels.name }} is in deprecated availability zone.

    Availability zone ru-central1-c is deprecated by Yandex.Cloud. You should migrate your NAT Instance to ru-central1-a or ru-central1-b zone.

    You can use the following instructions to migrate.

    IMPORTANT The following actions are destructive changes and cause downtime (typically a several tens of minutes, also it depending on the response time of Yandex Cloud).

    1. Migrate NAT Instance.

      Get providerClusterConfiguration.withNATInstance:

       kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerClusterConfiguration.withNATInstance'
      
      1. If you specified withNATInstance.natInstanceInternalAddress and/or withNATInstance.internalSubnetID in providerClusterConfiguration, you need to remove them with the following command:

         kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
        
      2. If you specified withNATInstance.externalSubnetID and/or withNATInstance.natInstanceExternalAddress in providerClusterConfiguration, you need to change these to the appropriate values.

        You can get address and subnetID from Yandex.Cloud console or with CLI

        Change withNATInstance.externalSubnetID and withNATInstance.natInstanceExternalAddress with the following command:

         kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
        
    2. Run the appropriate edition and version of the Deckhouse installer container on the local machine (change the container registry address if necessary) and do converge.

      1. Get edition and version of the Deckhouse:

         DH_VERSION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}')
         DH_EDITION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]')
         echo "DH_VERSION=$DH_VERSION DH_EDITION=$DH_EDITION"
        
      2. Run the installer:

         docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" registry.deckhouse.io/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash
        
      3. Do converge:

         dhctl converge --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> --ssh-host <MASTER-NODE-0-HOST>
        
    3. Update route table

      1. Get route table name

         kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.global.clusterConfiguration.cloud.prefix'
        
      2. Get NAT Instance name:

         kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerDiscoveryData.natInstanceName'
        
      3. Get NAT Instance internal IP

         yc compute instance list | grep -e "INTERNAL IP" -e <NAT_INSTANCE_NAME_FROM_PREVIOUS_STEP>
        
      4. Update route

         yc vpc route-table update --name <ROUTE_TABLE_NAME_FROM_PREVIOUS_STEP> --route "destination=0.0.0.0/0,next-hop=<NAT_INSTANCE_INTERNAL_IP_FROM_PREVIOUS_STEP>"
        
  • NodeGroupNodeWithDeprecatedAvailabilityZone CE S9
    NodeGroup {{ $labels.node_group }} contains Nodes with deprecated availability zone.

    Availability zone ru-central1-c is deprecated by Yandex.Cloud. You should migrate your Nodes, Disks and LoadBalancers to ru-central1-a, ru-central1-b or ru-central1-d (introduced in v1.56). To check which Nodes should be migrated, use kubectl get node -l "topology.kubernetes.io/zone=ru-central1-c" command.

    You can use Yandex Migration Guide (mostly applicable to the `ru-central1-d’ zone only).

    IMPORTANT You cannot migrate public IP addresses between zones. Check out the Yandex Migration Guide for details.

Модуль cni-cilium

  • CiliumAgentEndpointsNotReady CE S4
    More than half of all known Endpoints are not ready in agent {{ $labels.namespace }}/{{ $labels.pod }}.

    Check what’s going on: kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}

  • CiliumAgentMapPressureCritical CE S4
    eBPF map {{ $labels.map_name }} is more than 90% full in agent {{ $labels.namespace }}/{{ $labels.pod }}.

    We’ve reached resource limit of eBPF maps. Consult with vendor for possible remediation steps.

  • CiliumAgentMetricNotFound CE S4
    Some of the metrics are not coming from the agent {{ $labels.namespace }}/{{ $labels.pod }}.

    Use the following commands to check what’s going on:

    • kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}
    • kubectl -n {{ $labels.namespace }} exec -ti {{ $labels.pod }} cilium-health status

    We need to cross-check the metrics with the neighboring agent. Also the absence of metrics is an indirect sign that new pods cannot be created on the node because of the inability to connect to the agent. It is important to get a more specific way of determining the above situation and create a more accurate alert for the inability to connect new pods to the agent.

  • CiliumAgentPolicyImportErrors CE S4
    Agent {{ $labels.namespace }}/{{ $labels.pod }} fails to import policies.

    Check what’s going on: kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}

  • CiliumAgentUnreachableHealthEndpoints CE S4
    Some node's health endpoints are not reachable by agent {{ $labels.namespace }}/{{ $labels.pod }}.

    Check what’s going on: kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}

  • CniCiliumOrphanEgressGatewayPolicyFound EE S4
    Found orphan EgressGatewayPolicy with irrelevant EgressGateway name

    There is orphan EgressGatewayPolicy in the cluster: with the name: {{$labels.name}} which has irrelevant EgressGateway name.

    It is recommended to check EgressGateway name in EgressGatewayPolicy resource: {{$labels.egressgateway}}

Модуль control-plane-manager

  • D8ControlPlaneManagerPodNotRunning CE S6
    Controller Pod not running on Node {{ $labels.node }}

    Pod d8-control-plane-manager fails or not scheduled on Node {{ $labels.node }}

    Consider checking state of the kube-system/d8-control-plane-manager DaemonSet and its Pods: kubectl -n kube-system get daemonset,pod --selector=app=d8-control-plane-manager

  • D8KubeEtcdDatabaseSizeCloseToTheLimit CE S6
    etcd db size is close to the limit

    The size of the etcd database on {{ $labels.node }} has almost exceeded. Possibly there are a lot of events (e.g. Pod evictions) or a high number of other resources are created in the cluster recently.

    Possible solutions:

    • You can do defragmentation. Use next command: kubectl -n kube-system exec -ti etcd-{{ $labels.node }} -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
    • Increase node memory. Begin from 24 GB quota-backend-bytes will be increased on 1G every extra 8 GB node memory. For example: Node Memory quota-backend-bytes 16GB 2147483648 (2GB) 24GB 3221225472 (3GB) 32GB 4294967296 (4GB) 40GB 5368709120 (5GB) 48GB 6442450944 (6GB) 56GB 7516192768 (7GB) 64GB 8589934592 (8GB) 72GB 8589934592 (8GB) ….
  • D8KubernetesVersionIsDeprecated CE S7
    Kubernetes version "{{ $labels.k8s_version }}" is deprecated

    Current kubernetes version “{{ $labels.k8s_version }}” is deprecated, and its support will be removed within 6 months

    Please migrate to the next kubernetes version (at least 1.27)

    Check how to update the Kubernetes version in the cluster here - https://deckhouse.io/documentation/deckhouse-faq.html#how-do-i-upgrade-the-kubernetes-version-in-a-cluster

  • D8NeedDecreaseEtcdQuotaBackendBytes CE S6
    Deckhouse considers that quota-backend-bytes should be reduced.

    Deckhouse can increase quota-backend-bytes only. It happens when control-plane nodes memory was reduced. If is true, you should set quota-backend-bytes manually with controlPlaneManager.etcd.maxDbSize configuration parameter. Before set new value, please check current DB usage on every control-plane node:

    for pod in $(kubectl get pod -n kube-system -l component=etcd,tier=control-plane -o name); do kubectl -n kube-system exec -ti "$pod" -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key endpoint status -w json | jq --arg a "$pod" -r '.[0].Status.dbSize / 1024 / 1024 | tostring | $a + ": " + . + " MB"'; done
    

    Recommendations:

    • controlPlaneManager.etcd.maxDbSize maximum value is 8 GB.
    • If control-plane nodes have less than 24 GB, use 2 GB for controlPlaneManager.etcd.maxDbSize.
    • For >= 24GB increase value on 1GB every extra 8 GB. Node Memory quota-backend-bytes 16GB 2147483648 (2GB) 24GB 3221225472 (3GB) 32GB 4294967296 (4GB) 40GB 5368709120 (5GB) 48GB 6442450944 (6GB) 56GB 7516192768 (7GB) 64GB 8589934592 (8GB) 72GB 8589934592 (8GB) ….
  • KubernetesVersionEndOfLife CE S4
    Kubernetes version "{{ $labels.k8s_version }}" has reached End Of Life.

    Current kubernetes version “{{ $labels.k8s_version }}” support will be removed in the next Deckhouse release (1.58).

    Please migrate to the next kubernetes version (at least 1.24) as soon as possible.

    Check how to update the Kubernetes version in the cluster here - https://deckhouse.io/documentation/deckhouse-faq.html#how-do-i-upgrade-the-kubernetes-version-in-a-cluster

Модуль documentation

  • ModuleConfigDeprecated CE S9
    Deprecated ModuleConfig was found.

    The module deckhouse-web was renamed to the documentation.

    The new ModuleConfig documentation was generated automatically. Please, remove deprecated ModuleConfig deckhouse-web from the CI deploy process and delete it: kubectl delete mc deckhouse-web.

Модуль flant-integration

  • D8PrometheusMadisonErrorSendingAlerts BE S5
    Prometheus is unable to deliver 100% alerts.

    Prometheus is unable to deliver 100% alerts.

  • D8PrometheusMadisonErrorSendingAlerts BE S6
    Prometheus is unable to deliver 100% alerts through one or more madison-proxies.

    Prometheus is unable to deliver 100% alerts through one or more madison-proxies.

    You need to check the madison-proxy logs: kubectl -n d8-monitoring logs -f -l app=madison-proxy

  • D8PrometheusMadisonErrorSendingAlertsToBackend BE
    Prometheus is unable to deliver {{ $value | humanizePercentage }} alerts to the {{ $labels.madison_backend }} Madison backend using the {{ $labels.pod }} madison-proxy.
    Prometheus is unable to deliver {{ $value humanizePercentage }} alerts to the {{ $labels.madison_backend }} Madison backend using the {{ $labels.pod }} madison-proxy.

    You need to check the madison-proxy logs: kubectl -n d8-monitoring logs -f {{ $labels.pod }}

  • FlantPricingNotSendingSamples BE S6
    Flant-pricing cluster metrics are not being delivered

    Succeeded samples metric of the Grafana Agent is not increasing.

    To get more details, check logs of the following containers:

    • kubectl -n d8-flant-integration logs -l app=pricing -c grafana-agent
    • kubectl -n d8-flant-integration logs -l app=pricing -c pricing
  • FlantPricingSucceededSamplesMetricIsAbsent BE S6
    Crucial metrics are missing.

    There are no succeeded samples metric from the Grafana Agent.

    To get more details:

    Check pods state: kubectl -n d8-flant-integration get pod -l app=pricing or logs: kubectl -n d8-flant-integration logs -l app=pricing -c grafana-agent

Модуль flow-schema

  • KubernetesAPFRejectRequests CE S9
    APF flow schema d8-serviceaccounts has rejected API requests.

    To show APF schema queue requests, use the expr apiserver_flowcontrol_current_inqueue_requests{flow_schema="d8-serviceaccounts"}.

    Attention: This is an experimental alert!

Модуль ingress-nginx

  • D8NginxIngressKruiseControllerPodIsRestartingTooOften CE S8
    Too many kruise controller restarts have been detected in d8-ingress-nginx namespace.

    The number of restarts in the last hour: {{ $value }}. Excessive kruise controller restarts indicate that something is wrong. Normally, it should be up and running all the time.

    The recommended course of action:

    1. Check any events regarding kruise-controller-manager in d8-ingress-nginx namespace in case there were some issues there related to the nodes the manager runs on or memory shortage (OOM): kubectl -n d8-ingress-nginx get events | grep kruise-controller-manager
    2. Analyze the controller’s pods’ descriptions to check which containers were restarted and what were the possible reasons (exit codes, etc.): kubectl -n d8-ingress-nginx describe pod -lapp=kruise,control-plane=controller-manager
    3. In case kruise container was restarted, list relevant logs of the container to check if there were some meaningful errors there: kubectl -n d8-ingress-nginx logs -lapp=kruise,control-plane=controller-manager -c kruise
  • DeprecatedGeoIPVersion CE S9
    Deprecated GeoIP version 1 is being used in the cluster.

    There is an IngressNginxController and/or an Ingress object that utilize(s) Nginx GeoIPv1 module’s variables. The module is deprecated and its support is discontinued from Ingess Nginx Controller of version 1.10 and higher. It’s recommend to upgrade your configuration to use GeoIPv2 module. Use the following command to get the list of the IngressNginxControllers that contain GeoIPv1 variables: kubectl get ingressnginxcontrollers.deckhouse.io -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | .metadata.name'

    Use the following command to get the list of the Ingress objects that contain GeoIPv1 variables: kubectl get ingress -A -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | "\(.metadata.namespace)/\(.metadata.name)"' | sort | uniq

  • NginxIngressConfigTestFailed CE S4
    Config test failed on NGINX Ingress {{ $labels.controller }} in the {{ $labels.controller_namespace }} Namespace.

    The configuration testing (nginx -t) of the {{ $labels.controller }} Ingress controller in the {{ $labels.controller_namespace }} Namespace has failed.

    The recommended course of action:

    1. Check controllers logs: kubectl -n {{ $labels.controller_namespace }} logs {{ $labels.controller_pod }} -c controller;
    2. Find the newest Ingress in the cluster: kubectl get ingress --all-namespaces --sort-by="metadata.creationTimestamp";
    3. Probably, there is an error in configuration-snippet or server-snippet.
  • NginxIngressDaemonSetNotUpToDate CE S9
    There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} Ingress Nginx DaemonSet for the last 20 minutes.

    There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} Ingress Nginx DaemonSet for the last 20 minutes.

    The recommended course of action:

    1. Check the DaemonSet’s status: kubectl -n {{ $labels.namespace }} get ads {{ $labels.daemonset }}
    2. Analyze the DaemonSet’s description: kubectl -n {{ $labels.namespace }} describe ads {{ $labels.daemonset }}
    3. If the Number of Nodes Scheduled with Up-to-date Pods parameter does not match Current Number of Nodes Scheduled, check the pertinent Ingress Nginx Controller’s ‘nodeSelector’ and ‘toleration’ settings, and compare them to the relevant nodes’ ‘labels’ and ‘taints’ settings
  • NginxIngressDaemonSetReplicasUnavailable CE S4
    Count of available replicas in NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.

    Count of available replicas in NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.

    List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }}

    This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):

    kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • NginxIngressDaemonSetReplicasUnavailable CE S6
    Some replicas of NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} are unavailable.

    Some replicas of NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} are unavailable. Currently at: {{ .Value }} unavailable replica(s)

    List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }}

    This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):

    kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • NginxIngressPodIsRestartingTooOften CE S4
    Too many NGINX Ingress restarts have been detected.

    The number of restarts in the last hour: {{ $value }}. Excessive NGINX Ingress restarts indicate that something is wrong. Normally, it should be up and running all the time.

  • NginxIngressProtobufExporterHasErrors CE S8
    The Ingress Nginx sidecar container with protobuf_exporter has {{ $labels.type }} errors.

    The Ingress Nginx sidecar container with protobuf_exporter has {{ $labels.type }} errors.

    Please, check Ingress controller’s logs: kubectl -n d8-ingress-nginx logs $(kubectl -n d8-ingress-nginx get pods -l app=controller,name={{ $labels.controller }} -o wide | grep {{ $labels.node }} | awk '{print $1}') -c protobuf-exporter.

  • NginxIngressSslExpired CE S4
    Certificate has expired.

    SSL certificate for {{ $labels.host }} in {{ $labels.namespace }} has expired. You can verify the certificate with the kubectl -n {{ $labels.namespace }} get secret {{ $labels.secret_name }} -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates command.

    https://{{ $labels.host }} version of site doesn’t work!

  • NginxIngressSslWillExpire CE S5
    Certificate expires soon.

    SSL certificate for {{ $labels.host }} in {{ $labels.namespace }} will expire in less than 2 weeks. You can verify the certificate with the kubectl -n {{ $labels.namespace }} get secret {{ $labels.secret_name }} -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates command.

Модуль istio

  • D8IstioActualDataPlaneVersionNotEqualDesired EE S8
    There are Pods with istio data-plane version {{$labels.version}}, but desired version is {{$labels.desired_version}}

    There are Pods in Namespace {{$labels.namespace}} with istio data-plane version {{$labels.version}}, but the desired one is {{$labels.desired_version}}. Impact — istio version is to change after Pod restarting. Cheat sheet:

    ### namespace-wide configuration
    # istio.io/rev=vXYZ — use specific revision
    # istio-injection=enabled — use global revision
    kubectl get ns {{$labels.namespace}} --show-labels
    
    ### pod-wide configuration
    kubectl -n {{$labels.namespace}} get pods -l istio.io/rev={{$labels.desired_revision}}
    
  • D8IstioActualVersionIsNotInstalled EE S4
    control-plane version for Pod with already injected sidecar isn't installed

    There are pods with injected sidecar with version {{$labels.version}} (revision {{$labels.revision}}) in namespace {{$labels.namespace}}, but the control-plane version isn’t installed. Consider installing it or change the Namespace or Pod configuration. Impact — Pods have lost their sync with k8s state. Getting orphaned pods:

    kubectl -n {{ $labels.namespace }} get pods -l 'service.istio.io/canonical-name' -o json | jq --arg revision {{ $labels.revision }} '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
    
  • D8IstioAdditionalControlplaneDoesntWork CE S4
    Additional controlplane doesn't work.

    Additional istio controlplane {{$labels.label_istio_io_rev}} doesn’ work. Impact — sidecar injection for Pods with {{$labels.label_istio_io_rev}} revision doesn’t work.

    kubectl get pods -n d8-istio -l istio.io/rev={{$labels.label_istio_io_rev}}
    
  • D8IstioDataPlaneVersionMismatch EE S8
    There are Pods with data-plane version different from control-plane one.

    There are Pods in {{$labels.namespace}} namespace with istio data-plane version {{$labels.full_version}} which differ from control-plane one {{$labels.desired_full_version}}. Consider restarting affected Pods, use PromQL query to get the list:

    max by (namespace, dataplane_pod) (d8_istio_dataplane_metadata{full_version="{{$labels.full_version}}"})
    

    Also consider using the automatic istio data-plane update described in the documentation: https://deckhouse.io/documentation/v1/modules/110-istio/examples.html#upgrading-istio

  • D8IstioDataPlaneWithoutIstioInjectionConfigured EE S4
    There are Pods with istio sidecars, but without istio-injection configured

    There are Pods in {{$labels.namespace}} Namespace with istio sidecars, but the istio-injection isn’t configured. Impact — Pods will lose their istio sidecars after re-creation. Getting affected Pods:

    kubectl -n {{$labels.namespace}} get pods -o json | jq -r --arg revision {{$labels.revision}} '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
    
  • D8IstioDeprecatedIstioVersionInstalled CE
    There is deprecated istio version installed

    There is deprecated istio version {{$labels.version}} installed. Impact — version support will be removed in future deckhouse releases. The higher alert severity — the higher probability of support cancelling. Upgrading instructions — https://deckhouse.io/documentation//modules/110-istio/examples.html#upgrading-istio.

  • D8IstioDesiredVersionIsNotInstalled EE S6
    Desired control-plane version isn't installed

    There is desired istio control plane version {{$labels.desired_version}} (revision {{$labels.revision}}) configured for pods in namespace {{$labels.namespace}}, but the version isn’t installed. Consider installing it or change the Namespace or Pod configuration. Impact — Pods won’t be able to re-create in the {{$labels.namespace}} Namespace. Cheat sheet:

    ### namespace-wide configuration
    # istio.io/rev=vXYZ — use specific revision
    # istio-injection=enabled — use global revision
    kubectl get ns {{$labels.namespace}} --show-labels
    
    ### pod-wide configuration
    kubectl -n {{$labels.namespace}} get pods -l istio.io/rev={{$labels.revision}}
    
  • D8IstioFederationMetadataEndpointDoesntWork EE S6
    Federation metadata endpoint failed

    Metadata endpoint {{$labels.endpoint}} for IstioFederation {{$labels.federation_name}} has failed to fetch by d8 hook. Reproducing request to public endpoint:

    curl {{$labels.endpoint}}
    

    Reproducing request to private endpoints (run from deckhouse pod):

    KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
    LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
    REMOTE_CLUSTER_UUID="$(kubectl get istiofederation {{$labels.federation_name}} -o json | jq -r .status.metadataCache.public.clusterUUID)"
    TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-federation --ttl 1h)"
    curl -H "Authorization: Bearer $TOKEN" {{$labels.endpoint}}
    
  • D8IstioGlobalControlplaneDoesntWork CE S4
    Global controlplane doesn't work.

    Global istio controlplane {{$labels.label_istio_io_rev}} doesn’ work. Impact — sidecar injection for Pods with global revision doesn’t work, validating webhook for istio resources is absent.

    kubectl get pods -n d8-istio -l istio.io/rev={{$labels.label_istio_io_rev}}
    
  • D8IstioMulticlusterMetadataEndpointDoesntWork EE S6
    Multicluster metadata endpoint failed

    Metadata endpoint {{$labels.endpoint}} for IstioMulticluster {{$labels.multicluster_name}} has failed to fetch by d8 hook. Reproducing request to public endpoint:

    curl {{$labels.endpoint}}
    

    Reproducing request to private endpoints (run from deckhouse pod):

    KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
    LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
    REMOTE_CLUSTER_UUID="$(kubectl get istiomulticluster {{$labels.multicluster_name}} -o json | jq -r .status.metadataCache.public.clusterUUID)"
    TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-multicluster --ttl 1h)"
    curl -H "Authorization: Bearer $TOKEN" {{$labels.endpoint}}
    
  • D8IstioMulticlusterRemoteAPIHostDoesntWork EE S6
    Multicluster remote api host failed

    Remote api host {{$labels.api_host}} for IstioMulticluster {{$labels.multicluster_name}} has failed healthcheck by d8 monitoring hook.

    Reproducing (run from deckhouse pod):

    TOKEN="$(deckhouse-controller module values istio -o json | jq -r --arg ah {{$labels.api_host}} '.istio.internal.multiclusters[] | select(.apiHost == $ah) | .apiJWT')"
    curl -H "Authorization: Bearer $TOKEN" https://{{$labels.api_host}}/version
    
  • D8IstioOperatorReconcileError CE S5
    istio-operator is unable to reconcile istio control-plane setup.

    There is some error in istio-operator reconcilation loop. Please check the logs out:

    kubectl -n d8-istio logs -l app=operator,revision={{$labels.revision}}

  • D8IstioPodsWithoutIstioSidecar EE S4
    There are Pods without istio sidecars, but with istio-injection configured

    There is a Pod {{$labels.dataplane_pod}} in {{$labels.namespace}} Namespace without istio sidecars, but the istio-injection is configured. Getting affected Pods:

    kubectl -n {{$labels.namespace}} get pods -l '!service.istio.io/canonical-name' -o json | jq -r '.items[] | select(.metadata.annotations."sidecar.istio.io/inject" != "false") | .metadata.name'
    
  • D8IstioVersionIsIncompatibleWithK8sVersion CE S3
    The installed istio version is incompatible with the k8s version

    The current istio version {{$labels.istio_version}} may not work properly with the current k8s version {{$labels.k8s_version}}, because it is unsupported officially. Please upgrade istio as soon as possible. Upgrading instructions — https://deckhouse.io/documentation//modules/110-istio/examples.html#upgrading-istio.

  • IstioIrrelevantExternalServiceFound CE S5
    Found external service with irrelevant ports spec

    There is service in the namespace: {{$labels.namespace}} with the name: {{$labels.name}} which has irrelevant ports spec. .spec.ports[] do not make any sense for services with a type ExternalName but istio renders for External Services with ports listener “0.0.0.0:port” which catch all the traffic to the port. It is a problem for services out of istio registry.

    It is recommended to get rid of ports section (.spec.ports). It is safe.

Модуль l2-load-balancer

  • L2LoadBalancerOrphanServiceFound EE S4
    Found orphan service with irrelevant L2LoadBalancer name

    There is orphan service in the namespace: {{$labels.namespace}} with the name: {{$labels.name}} which has irrelevant L2LoadBalancer name.

    It is recommended to check L2LoadBalancer name in annotations (network.deckhouse.io/l2-load-balancer-name).

Модуль log-shipper

  • D8LogShipperAgentNotScheduledInCluster CE S7
    Pods of log-shipper-agent cannot be scheduled in the cluster.

    A number of log-shipper-agents are not scheduled.

    To check the state of the d8-log-shipper/log-shipper-agent DaemonSet:

    kubectl -n d8-log-shipper get daemonsets --selector=app=log-shipper
    

    To check the state of the d8-log-shipper/log-shipper-agent Pods:

    kubectl -n d8-log-shipper get pods --selector=app=log-shipper-agent
    

    The following command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place:

    kubectl -n d8-log-shipper get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="log-shipper-agent")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • D8LogShipperClusterLogDestinationD8LokiAuthorizationRequired CE S9
    Required authorization params for ClusterLogDestination.

    Found ClusterLogDestination resource {{$labels.resource_name}} without authorization params. You should add authorization params to the ClusterLogDestination resource.

  • D8LogShipperCollectLogErrors CE S4
    Pods of log-shipper-agent cannot collect logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.

    The {{ $labels.host }} log-shipper agent on the {{ $labels.node }} node failed to collect metrics for more than 10 minutes. The reason is {{ $labels.error_type }} errors occurred during the {{ $labels.stage }} stage while reading {{ $labels.component_type }}.

    Consider checking logs of the pod or follow advanced debug instructions. kubectl -n d8-log-shipper logs {{ $labels.host }} -c vector

  • D8LogShipperDestinationErrors CE S4
    Pods of log-shipper-agent cannot send logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.

    Logs do not reach their destination, the {{ $labels.host }} log-shipper agent on the {{ $labels.node }} node cannot send logs for more than 10 minutes. The reason is {{ $labels.error_type }} errors occurred during the {{ $labels.stage }} stage while sending logs to {{ $labels.component_type }}.

    Consider checking logs of the pod or follow advanced debug instructions. kubectl -n d8-log-shipper logs {{ $labels.host }} -c vector

  • D8LogShipperLogsDroppedByRateLimit CE S4
    Pods of log-shipper-agent drop logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.

    Rate limit rules are applied, log-shipper agent on the {{ $labels.node }} node is dropping logs for more than 10 minutes.

    Consider checking logs of the pod or follow advanced debug instructions. kubectl -n d8-log-shipper get pods -o wide | grep {{ $labels.node }}

Модуль metallb

  • D8MetalLBBGPSessionDown SE S4
    MetalLB BGP session down.

    {{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has BGP session {{ $labels.peer }} down. Details are in logs:

    kubectl -n d8-metallb logs daemonset/speaker -c speaker
    
  • D8MetalLBConfigNotLoaded SE S4
    MetalLB config not loaded.

    {{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has not loaded. To figure out the problem, check controller logs:

    kubectl -n d8-metallb logs deploy/controller -c controller
    
  • D8MetalLBConfigStale SE S4
    MetalLB running on a stale configuration, because the latest config failed to load.

    {{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has run on a stale configuration, because the latest config failed to load. To figure out the problem, check controller logs:

    kubectl -n d8-metallb logs deploy/controller -c controller
    

Модуль monitoring-custom

  • D8ReservedNodeLabelOrTaintFound CE S6
    Node {{ $labels.name }} needs fixing up

    Node {{ $labels.name }} uses:

    • reserved metadata.labels node-role.deckhouse.io/ with ending not in (system|frontend|monitoring|_deckhouse_module_name_)
    • or reserved spec.taints dedicated.deckhouse.io with values not in (system|frontend|monitoring|_deckhouse_module_name_)

    Get instructions on how to fix it here.

Модуль monitoring-deckhouse

  • D8DeckhouseConfigInvalid CE S5
    Deckhouse config is invalid.

    Deckhouse config contains errors.

    Please check Deckhouse logs by running kubectl -n d8-system logs -f -l app=deckhouse.

    Edit Deckhouse global configuration by running kubectl edit mc global or configuration of the specific module by running kubectl edit mc <MODULE_NAME>

  • D8DeckhouseCouldNotDeleteModule CE S4
    Deckhouse is unable to delete the {{ $labels.module }} module.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotDiscoverModules CE S4
    Deckhouse is unable to discover modules.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunGlobalHook CE S5
    Deckhouse is unable to run the {{ $labels.hook }} global hook.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunModule CE S4
    Deckhouse is unable to start the {{ $labels.module }} module.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunModuleHook CE S7
    Deckhouse is unable to run the {{ $labels.module }}/{{ $labels.hook }} module hook.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCustomTargetDown CE S4
    Prometheus is unable to scrape custom metrics generated by Deckhouse hooks.
  • D8DeckhouseDeprecatedConfigmapManagedByArgoCD CE S4
    Deprecated deckhouse configmap managed by Argo CD

    The deckhouse configmap is no longer used. You need to remove configmap “d8-system/deckhouse” from ArgoCD

  • D8DeckhouseGlobalHookFailsTooOften CE S9
    The {{ $labels.hook }} Deckhouse global hook crashes way too often.

    The {{ $labels.hook }} has failed in the last __SCRAPE_INTERVAL_X_4__.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseHasNoAccessToRegistry CE S7
    Deckhouse is unable to connect to the registry.

    Deckhouse is unable to connect to the registry (registry.deckhouse.io in most cases) to check for a new Docker image (checks are performed every 15 seconds). Deckhouse does not have access to the registry; automatic updates are not available.

    Usually, this alert means that the Deckhouse Pod is having difficulties with connecting to the Internet.

  • D8DeckhouseIsHung CE S4
    Deckhouse is down.

    Deckhouse is probably down since the deckhouse_live_ticks metric in Prometheus is no longer increasing (it is supposed to increment every 10 seconds).

  • D8DeckhouseIsNotOnReleaseChannel CE S9
    Deckhouse in the cluster is not subscribed to one of the regular release channels.

    Deckhouse is on a custom branch instead of one of the regular release channels.

    It is recommended that Deckhouse be subscribed to one of the following channels: Alpha, Beta, EarlyAccess, Stable, RockSolid.

    Use the command below to find out what release channel is currently in use: kubectl -n d8-system get deploy deckhouse -o json | jq '.spec.template.spec.containers[0].image' -r

    Subscribe the cluster to one of the regular release channels.

  • D8DeckhouseModuleHookFailsTooOften CE S9
    The {{ $labels.module }}/{{ $labels.hook }} Deckhouse hook crashes way too often.

    The {{ $labels.hook }} hook of the {{ $labels.module }} module has failed in the last __SCRAPE_INTERVAL_X_4__.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhousePodIsNotReady CE S4
    The Deckhouse Pod is NOT Ready.
  • D8DeckhousePodIsNotRunning CE S4
    The Deckhouse Pod is NOT Running.
  • D8DeckhousePodIsRestartingTooOften CE S9
    Excessive Deckhouse restarts detected.

    The number of restarts in the last hour: {{ $value }}.

    Excessive Deckhouse restarts indicate that something is wrong. Normally, Deckhouse should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseQueueIsHung CE S7
    The {{ $labels.queue }} Deckhouse queue has hung; there are {{ $value }} task(s) in the queue.

    Deckhouse cannot finish processing of the {{ $labels.queue }} queue with {{ $value }} tasks piled up.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseSelfTargetAbsent CE S4
    There is no Deckhouse target in Prometheus.
  • D8DeckhouseSelfTargetDown CE S4
    Prometheus is unable to scrape Deckhouse metrics.
  • D8DeckhouseWatchErrorOccurred CE S5
    Possible apiserver connection error in the client-go informer, check logs and snapshots.

    Error occurred in the client-go informer, possible problems with connection to apiserver.

    Check Deckhouse logs for more information by running: kubectl -n d8-system logs deploy/deckhouse | grep error | grep -i watch

    This alert is an attempt to detect the correlation between the faulty snapshot invalidation and apiserver connection errors, especially for the handle-node-template hook in the node-manager module. Check the difference between the snapshot and actual node objects for this hook: diff -u <(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'|sort) <(kubectl -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller module snapshots node-manager -o json | jq '."040-node-manager/hooks/handle_node_templates.go"' | jq '.nodes.snapshot[] | .filterResult.Name' -r | sort)

  • D8NodeHasDeprecatedOSVersion CE S4
    Nodes have deprecated OS versions.

    Some nodes have deprecated OS versions. Please update nodes to actual OS version.

    To observe affected nodes use the expr kube_node_info{os_image=~"Debian GNU/Linux 9.*"} in Prometheus.

  • D8NodeHasDeprecatedOSVersion CE S4
    Nodes have deprecated OS versions.

    Some nodes have deprecated OS versions. Please update nodes to actual OS version.

    To observe affected nodes use the expr kube_node_info{os_image=~"Ubuntu 18.04.*"} in Prometheus.

  • D8NodeHasUnmetKernelRequirements CE S4
    Nodes have unmet kernel requirements

    Some nodes have unmet kernel constraints. This means that some modules cannot be run on that nodes. Current kernel constraint requirements: For Cilium module kernel should be >= 4.9.17. For Cilium with Istio modules kernel should be >= 5.7. For Cilium with OpenVPN modules kernel should be >= 5.7. For Cilium with Node-local-dns modules kernel should be >= 5.7.

    To observe affected nodes use the expr d8_node_kernel_does_not_satisfy_requirements == 1 in Prometheus.

  • DeckhouseReleaseDisruptionApprovalRequired CE S4
    Deckhouse release disruption approval required.

    Deckhouse release contains disruption update.

    You can figure out more details by running kubectl describe DeckhouseRelease {{ $labels.name }}. If you are ready to deploy this release, run: kubectl annotate DeckhouseRelease {{ $labels.name }} release.deckhouse.io/disruption-approved=true.

  • DeckhouseReleaseIsBlocked CE S5
    Deckhouse release requirements unmet.

    Deckhouse release requirements is not met.

    Please run kubectl describe DeckhouseRelease {{ $labels.name }} for details.

  • DeckhouseReleaseIsWaitingManualApproval CE S3
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseReleaseIsWaitingManualApproval CE S6
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseReleaseIsWaitingManualApproval CE S9
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseUpdating CE S4
    Deckhouse is being updated.
  • DeckhouseUpdatingFailed CE S4
    Deckhouse updating is failed.

    Failed to update Deckhouse.

    Next version minor/path Deckhouse image is not available in the registry or the image is corrupted. Actual version: {{ $labels.version }}.

    Make sure that the next version Deckhouse image is available in the registry.

  • MigrationRequiredFromRBDInTreeProvisionerToCSIDriver CE S9
    Storage class {{ $labels.storageclass }} uses the deprecated rbd provisioner. It is necessary to migrate the volumes to the Ceph CSI driver.

    To migrate volumes use this script https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/tools/rbd-in-tree-to-ceph-csi-migration-helper.sh A description of how the migration is performed can be found here https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/docs/internal/INTREE_MIGRATION.md

Модуль monitoring-kubernetes-control-plane

  • K8SApiserverDown CE S3
    No API servers are reachable

    No API servers are reachable or all have disappeared from service discovery

  • K8sCertificateExpiration CE S5
    Kubernetes has API clients with soon expiring certificates

    Some clients connect to {{$labels.component}} with certificate which expiring soon (less than 1 day) on node {{$labels.component}}.

    You need to use kubeadm to check control plane certificates.

    1. Install kubeadm: apt install kubeadm=1.24.*.
    2. Check certificates: kubeadm alpha certs check-expiration

    To check kubelet certificates, on each node you need to:

    1. Check kubelet config:
      ps aux \
        | grep "/usr/bin/kubelet" \
        | grep -o -e "--kubeconfig=\S*" \
        | cut -f2 -d"=" \
        | xargs cat
      
    2. Find field client-certificate or client-certificate-data
    3. Check certificate using openssl

    There are no tools to help you find other stale kubeconfigs. It will be better for you to enable control-plane-manager module to be able to debug in this case.

  • K8sCertificateExpiration CE S6
    Kubernetes has API clients with soon expiring certificates

    Some clients connect to {{$labels.component}} with certificate which expiring soon (less than 7 days) on node {{$labels.node}}.

    You need to use kubeadm to check control plane certificates.

    1. Install kubeadm: apt install kubeadm=1.24.*.
    2. Check certificates: kubeadm alpha certs check-expiration

    To check kubelet certificates, on each node you need to:

    1. Check kubelet config:
      ps aux \
        | grep "/usr/bin/kubelet" \
        | grep -o -e "--kubeconfig=\S*" \
        | cut -f2 -d"=" \
        | xargs cat
      
    2. Find field client-certificate or client-certificate-data
    3. Check certificate using openssl

    There are no tools to help you find other stale kubeconfigs. It will be better for you to enable control-plane-manager module to be able to debug in this case.

  • K8SControllerManagerTargetDown CE S3
    Controller manager is down

    There is no running kube-controller-manager. Deployments and replication controllers are not making progress.

  • K8SSchedulerTargetDown CE S3
    Scheduler is down

    There is no running K8S scheduler. New pods are not being assigned to nodes.

  • KubeEtcdHighFsyncDurations CE S7
    Synching (fsync) WAL files to disk is slow.

    In the last 15 minutes, the 99th percentile of the fsync duration for WAL files is longer than 0.5 seconds: {{ $value }}.

    Possible causes:

    1. High latency of the disk where the etcd data is located;
    2. High CPU usage on the Node.
  • KubeEtcdHighNumberOfLeaderChanges CE S5
    The etcd cluster re-elects the leader too often.

    There were {{ $value }} leader re-elections for the etcd cluster member running on the {{ $labels.node }} Node in the last 10 minutes.

    Possible causes:

    1. High latency of the disk where the etcd data is located;
    2. High CPU usage on the Node;
    3. Degradation of network connectivity between cluster members in the multi-master mode.
  • KubeEtcdInsufficientMembers CE S4
    There are insufficient members in the etcd cluster; the cluster will fail if one of the remaining members will become unavailable.

    Check the status of the etcd pods: kubectl -n kube-system get pod -l component=etcd.

  • KubeEtcdNoLeader CE S4
    The etcd cluster member running on the {{ $labels.node }} Node has lost the leader.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd | grep {{ $labels.node }}.

  • KubeEtcdTargetAbsent CE S5
    There is no etcd target in Prometheus.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd or Prometheus logs: kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus

  • KubeEtcdTargetDown CE S5
    Prometheus is unable to scrape etcd metrics.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd or Prometheus logs: kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus.

Модуль monitoring-ping

  • NodePingPacketLoss CE S4
    Ping loss more than 5%

    ICMP packet loss to node {{$labels.destination_node}} is more than 5%

Модуль node-manager

  • ClusterHasOrphanedDisks CE S6
    Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster

    Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster. You can manually delete these disks from your cloud: ID: {{ $labels.id }}, Name: {{ $labels.name }}

  • D8BashibleApiserverLocked CE S6
    Bashible-apiserver is locked for too long

    Check bashible-apiserver pods are up-to-date and running kubectl -n d8-cloud-instance-manager get pods -l app=bashible-apiserver

  • D8CloudDataDiscovererCloudRequestError CE S6
    Cloud data discoverer cannot get data from cloud

    Cloud data discoverer cannot get data from cloud. See cloud data discoverer logs for more information: kubectl -n {{ $labels.namespace }} logs deploy/cloud-data-discoverer

  • D8CloudDataDiscovererSaveError CE S6
    Cloud data discoverer cannot save data to k8s resource

    Cloud data discoverer cannot save data to k8s resource. See cloud data discoverer logs for more information: kubectl -n {{ $labels.namespace }} logs deploy/cloud-data-discoverer

  • D8ClusterAutoscalerManagerPodIsNotReady CE S8
    The {{$labels.pod}} Pod is NOT Ready.
  • D8ClusterAutoscalerPodIsNotRunning CE S8
    The cluster-autoscaler Pod is NOT Running.

    The {{$labels.pod}} Pod is {{$labels.phase}}.

    Run the following command to check its status: kubectl -n {{$labels.namespace}} get pods {{$labels.pod}} -o json | jq .status.

  • D8ClusterAutoscalerPodIsRestartingTooOften CE S9
    Too many cluster-autoscaler restarts have been detected.

    The number of restarts in the last hour: {{ $value }}.

    Excessive cluster-autoscaler restarts indicate that something is wrong. Normally, it should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler.

  • D8ClusterAutoscalerTargetAbsent CE S8
    There is no cluster-autoscaler target in Prometheus.

    Cluster-autoscaler automatically scales Nodes in the cluster; its unavailability will result in the inability to add new Nodes if there is a lack of resources to schedule Pods. In addition, the unavailability of cluster-autoscaler may result in over-spending due to provisioned but inactive cloud instances.

    The recommended course of action:

    1. Check the availability and status of cluster-autoscaler Pods: kubectl -n d8-cloud-instance-manager get pods -l app=cluster-autoscaler
    2. Check whether the cluster-autoscaler deployment is present: kubectl -n d8-cloud-instance-manager get deploy cluster-autoscaler
    3. Check the status of the cluster-autoscaler deployment: kubectl -n d8-cloud-instance-manager describe deploy cluster-autoscaler
  • D8ClusterAutoscalerTargetDown CE S8
    Prometheus is unable to scrape cluster-autoscaler's metrics.
  • D8ClusterAutoscalerTooManyErrors CE S8
    Cluster-autoscaler issues too many errors.

    Cluster-autoscaler’s scaling attempt resulted in an error from the cloud provider.

    Please, refer to the corresponding logs: kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler.

  • D8MachineControllerManagerPodIsNotReady CE S8
    The {{$labels.pod}} Pod is NOT Ready.
  • D8MachineControllerManagerPodIsNotRunning CE S8
    The machine-controller-manager Pod is NOT Running.

    The {{$labels.pod}} Pod is {{$labels.phase}}.

    Run the following command to check the status of the Pod: kubectl -n {{$labels.namespace}} get pods {{$labels.pod}} -o json | jq .status.

  • D8MachineControllerManagerPodIsRestartingTooOften CE S9
    The machine-controller-manager module restarts too often.

    The number of restarts in the last hour: {{ $value }}.

    Excessive machine-controller-manager restarts indicate that something is wrong. Normally, it should be up and running all the time.

    Please, refer to the logs: kubectl -n d8-cloud-instance-manager logs -f -l app=machine-controller-manager -c controller.

  • D8MachineControllerManagerTargetAbsent CE S8
    There is no machine-controller-manager target in Prometheus.

    Machine controller manager manages ephemeral Nodes in the cluster. Its unavailability will result in the inability to add/delete Nodes.

    The recommended course of action:

    1. Check the availability and status of machine-controller-manager Pods: kubectl -n d8-cloud-instance-manager get pods -l app=machine-controller-manager;
    2. Check the availability of the machine-controller-manager Deployment: kubectl -n d8-cloud-instance-manager get deploy machine-controller-manager;
    3. Check the status of the machine-controller-manager Deployment: kubectl -n d8-cloud-instance-manager describe deploy machine-controller-manager.
  • D8MachineControllerManagerTargetDown CE S8
    Prometheus is unable to scrape machine-controller-manager's metrics.
  • D8NodeGroupIsNotUpdating CE S8
    The {{ $labels.node_group }} node group is not handling the update correctly.

    There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, no Node can get approval to start updating.

    Most likely, there is a problem with the update_approval hook of the node-manager module.

  • D8NodeIsNotUpdating CE S7
    The {{ $labels.node }} Node cannot complete the update.

    There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime. The update manager (the update_approval hook of node-group module) performed the update, and the Node received downtime approval. However, there is no success message about the update.

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsNotUpdating CE S8
    The {{ $labels.node }} Node cannot complete the update.

    There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group}; the Node has learned about the update, requested and received approval, but cannot complete the update.

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsNotUpdating CE S9
    The {{ $labels.node }} Node does not update.

    There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group but it has not received the update nor trying to.

    Most likely Bashible for some reason is not handling the update correctly. At this point, it must add the update.node.deckhouse.io/waiting-for-approval annotation to the Node so that it can be approved.

    You can find out the most current version of the update using this command:

    kubectl -n d8-cloud-instance-manager get secret configuration-checksums -o jsonpath={.data.{{ $labels.node_group }}} | base64 -d
    

    Use the following command to find out the version on the Node:

    kubectl get node {{ $labels.node }} -o jsonpath='{.metadata.annotations.node\.deckhouse\.io/configuration-checksum}'
    

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsUnmanaged CE S9
    The {{ $labels.node }} Node is not managed by the node-manager module.

    The {{ $labels.node }} Node is not managed by the node-manager module.

    The recommended actions are as follows:

    • Follow these instructions to clean up the node and add it to the cluster: http://documentation.example.com/modules/040-node-manager/faq.html#how-to-clean-up-a-node-for-adding-to-the-cluster
  • D8NodeUpdateStuckWaitingForDisruptionApproval CE S8
    The {{ $labels.node }} Node cannot get disruption approval.

    There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, and ran into a stage that causes possible downtime. For some reason, the Node cannot get that approval (it is issued fully automatically by the update_approval hook of the node-manager).

  • D8ProblematicNodeGroupConfiguration CE S8
    The {{ $labels.node }} Node cannot begin the update.

    There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, {{ $labels.node }} Node cannot be updated.

    Node {{ $labels.node }} has no node.deckhouse.io/configuration-checksum annotation. Perhaps the bootstrap process of the Node did not complete correctly. Check the cloud-init logs (/var/log/cloud-init-output.log) of the Node. There is probably a problematic NodeGroupConfiguration resource for {{ $labels.node_group }} NodeGroup.

  • EarlyOOMPodIsNotReady CE S8
    The {{$labels.pod}} Pod has detected unavailable PSI subsystem. Check logs for additional information: kubectl -n d8-cloud-instance-manager logs {{$labels.pod}} Possible actions to resolve the problem: * Upgrade kernel to version 4.20 or higher. * Enable Pressure Stall Information. * Disable early oom.
  • NodeGroupHasStaticInternalNetworkCIDRsField CE S9
    NodeGroup {{ $labels.name }} has deprecated filed spec.static.internalNetworkCIDRs

    Internal network CIDRs setting now located in the static cluster configuration. Delete this field from NodeGroup {{ $labels.name }} to fix this alert. Do not worry, it has been already migrated to another place.

  • NodeGroupMasterTaintIsAbsent CE S4
    The 'master' node group does not contain desired taint.

    master node group has no node-role.kubernetes.io/control-plane taint. Probably control-plane nodes are misconfigured and are able to run not only control-plane Pods. Please, add:

      nodeTemplate:
        taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
    

    to the master node group spec. key: node-role.kubernetes.io/master taint was deprecated and will have no effect in Kubernetes 1.24+.

  • NodeGroupReplicasUnavailable CE S7
    There are no available instances in the {{ $labels.node_group }} node group.

    Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng {{ $labels.node_group }} -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeGroupReplicasUnavailable CE S8
    The number of simultaneously unavailable instances in the {{ $labels.node_group }} node group exceeds the allowed value.

    Possibly, autoscaler has provisioned too many Nodes. Take a look at the state of the Machine in the cluster. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng {{ $labels.node_group }} -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeGroupReplicasUnavailable CE S8
    There are unavailable instances in the {{ $labels.node_group }} node group.

    The number of unavailable instances is {{ $value }}. See the relevant alerts for more information. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng {{ $labels.node_group }} -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeRequiresDisruptionApprovalForUpdate CE S8
    The {{ $labels.node }} Node requires disruption approval to proceed with the update

    There is a new update for Nodes and the {{ $labels.node }} Node of the {{ $labels.node_group }} group has learned about the update, requested and received approval, started the update, and ran into a step that causes possible downtime.

    You have to manually approve the disruption since the Manual mode is active in the group settings (disruptions.approvalMode).

    Grant approval to the Node using the update.node.deckhouse.io/disruption-approved= annotation if it is ready for unsafe updates (e.g., drained).

    Caution!!! The Node will not be drained automatically since the manual mode is enabled (disruptions.approvalMode: Manual).

    Caution!!! No need to drain the master node.

    • Use the following commands to drain the Node and grant it update approval:
      kubectl drain {{ $labels.node }} --delete-local-data=true --ignore-daemonsets=true --force=true &&
        kubectl annotate node {{ $labels.node }} update.node.deckhouse.io/disruption-approved=
      
    • Note that you need to uncordon the node after the update is complete (i.e., after removing the update.node.deckhouse.io/approved annotation).
      while kubectl get node {{ $labels.node }} -o json | jq -e '.metadata.annotations | has("update.node.deckhouse.io/approved")' > /dev/null; do sleep 1; done
      kubectl uncordon {{ $labels.node }}
      

    Note that if there are several Nodes in a NodeGroup, you will need to repeat this operation for each Node since only one Node can be updated at a time. Perhaps it makes sense to temporarily enable the automatic disruption approval mode (disruptions.approvalMode: Automatic).

  • NodeStuckInDraining CE S6
    The {{ $labels.node }} Node is stuck in draining.

    The {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup stuck in draining.

    You can get more info by running: kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=DrainFailed --sort-by='.metadata.creationTimestamp'

    The error message is: {{ $labels.message }}

  • NodeStuckInDrainingForDisruptionDuringUpdate CE S6
    The {{ $labels.node }} Node is stuck in draining.

    There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup. The Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime, and stuck in draining in order to get disruption approval automatically.

    You can get more info by running: kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=ScaleDown --sort-by='.metadata.creationTimestamp'

Модуль okmeter

  • D8OkmeterAgentPodIsNotReady CE S6
    Okmeter agent is not Ready

Модуль operator-prometheus

  • D8PrometheusOperatorPodIsNotReady CE S7
    The prometheus-operator Pod is NOT Ready.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.
  • D8PrometheusOperatorPodIsNotRunning CE S7
    The prometheus-operator Pod is NOT Running.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.
  • D8PrometheusOperatorTargetAbsent CE S7
    There is no prometheus-operator target in Prometheus.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action is to analyze the deployment information: kubectl -n d8-operator-prometheus describe deploy prometheus-operator.

  • D8PrometheusOperatorTargetDown CE S8
    Prometheus is unable to scrape prometheus-operator metrics.

    The prometheus-operator Pod is not available.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.

Модуль prometheus

  • D8GrafanaDeploymentReplicasUnavailable CE S6
    One or more Grafana Pods are NOT Running.

    The number of Grafana replicas is less than the specified number.

    The Deployment is in the MinimumReplicasUnavailable state.

    Run the following command to check the status of the Deployment: kubectl -n d8-monitoring get deployment grafana -o json | jq .status.

    Run the following command to check the status of the Pods: kubectl -n d8-monitoring get pods -l app=grafana -o json | jq '.items[] | {(.metadata.name):.status}'.

  • D8GrafanaDeprecatedCustomDashboardDefinition CE S9
    The deprecated ConfigMap for defining Grafana dashboards is detected.

    The grafana-dashboard-definitions-custom ConfigMap was found in the d8-monitoring namespace. This means that the deprecated method of registering custom dashboards in Grafana is being used.

    This method is no longer used! Please, use the custom GrafanaDashboardDefinition resource instead.

  • D8GrafanaPodIsNotReady CE S6
    The Grafana Pod is NOT Ready.
  • D8GrafanaPodIsRestartingTooOften CE S9
    Excessive Grafana restarts are detected.

    The number of restarts in the last hour: {{ $value }}.

    Excessive Grafana restarts indicate that something is wrong. Normally, Grafana should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-monitoring logs -f -l app=grafana -c grafana.

  • D8GrafanaTargetAbsent CE S6
    There is no Grafana target in Prometheus.

    Grafana visualizes metrics collected by Prometheus. Grafana is critical for some tasks, such as monitoring the state of applications and the cluster as a whole. Additionally, Grafana unavailability can negatively impact users who actively use it in their work.

    The recommended course of action:

    1. Check the availability and status of Grafana Pods: kubectl -n d8-monitoring get pods -l app=grafana;
    2. Check the availability of the Grafana Deployment: kubectl -n d8-monitoring get deployment grafana;
    3. Examine the status of the Grafana Deployment: kubectl -n d8-monitoring describe deployment grafana.
  • D8GrafanaTargetDown CE S6
    Prometheus is unable to scrape Grafana metrics.
  • D8PrometheusLongtermFederationTargetDown CE S5
    prometheus-longterm cannot scrape prometheus.

    prometheus-longterm cannot scrape “/federate” endpoint from Prometheus. Check error cause in prometheus-longterm WebUI or logs.

  • D8PrometheusLongtermTargetAbsent CE S7
    There is no prometheus-longterm target in Prometheus.

    This Prometheus component is only used to display historical data and is not crucial. However, if its unavailability will last long enough, you will not be able to view the statistics.

    Usually, Pods of this type have problems because of disk unavailability (e.g., the disk cannot be mounted to a Node for some reason).

    The recommended course of action:

    1. Take a look at the StatefulSet data: kubectl -n d8-monitoring describe statefulset prometheus-longterm;
    2. Explore its PVC (if used): kubectl -n d8-monitoring describe pvc prometheus-longterm-db-prometheus-longterm-0;
    3. Explore the Pod’s state: kubectl -n d8-monitoring describe pod prometheus-longterm-0.
  • D8TricksterTargetAbsent CE S5
    There is no Trickster target in Prometheus.

    The following modules use this component:

    • prometheus-metrics-adapter — the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption using kubectl;
    • vertical-pod-autoscaler — this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;
    • grafana — by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to its unavailability.

    The recommended course of action:

    1. Analyze the Deployment information: kubectl -n d8-monitoring describe deployment trickster;
    2. Analyze the Pod information: kubectl -n d8-monitoring describe pod -l app=trickster;
    3. Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running: kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main.
  • D8TricksterTargetAbsent CE S5
    There is no Trickster target in Prometheus.

    The following modules use this component:

    • prometheus-metrics-adapter — the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption using kubectl;
    • vertical-pod-autoscaler — this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;
    • grafana — by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to unavailability.

    The recommended course of action:

    1. Analyze the Deployment stats: kubectl -n d8-monitoring describe deployment trickster;
    2. Analyze the Pod stats: kubectl -n d8-monitoring describe pod -l app=trickster;
    3. Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running: kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main.
  • DeckhouseModuleUseEmptyDir CE S9
    Deckhouse module {{ $labels.module_name }} use emptydir as storage.

    Deckhouse module {{ $labels.module_name }} use emptydir as storage.

  • GrafanaDashboardAlertRulesDeprecated CE S8
    Deprecated Grafana alerts have been found.

    Before updating to Grafana 10, it’s required to migrate an outdated alerts from Grafana to the external alertmanager (or exporter-alertmanager stack) To list all deprecated alert rules use the expr sum by (dashboard, panel, alert_rule) (d8_grafana_dashboards_deprecated_alert_rule) > 0

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • GrafanaDashboardPanelIntervalDeprecated CE S8
    Deprecated Grafana panel intervals have been found.

    Before updating to Grafana 10, it’s required to rewrite an outdated expressions that uses $interval_rv, interval_sx3 or interval_sx4 to $__rate_interval To list all deprecated panel intervals use the expr sum by (dashboard, panel, interval) (d8_grafana_dashboards_deprecated_interval) > 0

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • GrafanaDashboardPluginsDeprecated CE S8
    Deprecated Grafana plugins have been found.

    Before updating to Grafana 10, it’s required to check if currently installed plugins will work correctly with Grafana 10 To list all potentially outdated plugins use the expr sum by (dashboard, panel, plugin) (d8_grafana_dashboards_deprecated_plugin) > 0

    Plugin “flant-statusmap-panel” is being deprecated and won’t be supported in the near future We recommend you to migrate to the State Timeline plugin: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/state-timeline/

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • K8STooManyNodes CE S7
    Nodes amount is close to the maximum allowed amount.
    Cluster is running {{ $value }} nodes, close to the maximum amount of {{ print “d8_max_nodes_amount{}” query first value }} nodes.
  • PrometheusDiskUsage CE S4
    Prometheus disk is over 95% used.

    For more information, use the command:

    kubectl -n {{ $labels.namespace }} exec -ti {{ $labels.pod_name }} -c prometheus -- df -PBG /prometheus
    

    Consider increasing it https://deckhouse.io/documentation/v1/modules/300-prometheus/faq.html#how-to-expand-disk-size

  • PrometheusLongtermRotatingEarlierThanConfiguredRetentionDays CE S4
    Prometheus-longterm data is being rotated earlier than configured retention days

    You need to increase the disk size, reduce the number of metrics or decrease longtermRetentionDays module parameter.

  • PrometheusMainRotatingEarlierThanConfiguredRetentionDays CE S4
    Prometheus-main data is being rotated earlier than configured retention days

    You need to increase the disk size, reduce the number of metrics or decrease retentionDays module parameter.

  • PrometheusScapeConfigDeclarationDeprecated CE S8
    AdditionalScrapeConfigs from secrets will be deprecated in soon

    Old way for describing additional scrape config via secrets will be deprecated in prometheus-operator > v0.65.1. Please use CRD ScrapeConfig instead. https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/proposals/202212-scrape-config.md

  • PrometheusServiceMonitorDeprecated CE S8
    Deprecated Prometheus ServiceMonitor has found.

    Kubernetes cluster uses a more advanced network mechanism - EndpointSlice You service monitor {{ $labels.namespace }}/{{ $labels.name }} has relabeling with old Endpoint mechanism, starts with __meta_kubernetes_endpoints_. This relabeling rule support, based on the _endpoint_ label, will be remove in the future (Deckhouse release 1.60). Please, migrate to EndpointSlice relabeling rules. To do this, you have modify ServiceMonitor with changing the following labels:

    __meta_kubernetes_endpoints_name -> __meta_kubernetes_endpointslice_name
    __meta_kubernetes_endpoints_label_XXX -> __meta_kubernetes_endpointslice_label_XXX
    __meta_kubernetes_endpoints_labelpresent_XXX -> __meta_kubernetes_endpointslice_labelpresent_XXX
    __meta_kubernetes_endpoints_annotation_XXX -> __meta_kubernetes_endpointslice_annotation_XXX
    __meta_kubernetes_endpoints_annotationpresent_XXX -> __meta_kubernetes_endpointslice_annotationpresent_XXX
    __meta_kubernetes_endpoint_node_name -> __meta_kubernetes_endpointslice_endpoint_topology_kubernetes_io_hostname
    __meta_kubernetes_endpoint_ready -> __meta_kubernetes_endpointslice_endpoint_conditions_ready
    __meta_kubernetes_endpoint_port_name -> __meta_kubernetes_endpointslice_port_name
    __meta_kubernetes_endpoint_port_protocol -> __meta_kubernetes_endpointslice_port_protocol
    __meta_kubernetes_endpoint_address_target_kind -> __meta_kubernetes_endpointslice_address_target_kind
    __meta_kubernetes_endpoint_address_target_name -> __meta_kubernetes_endpointslice_address_target_name
    
  • TargetDown CE S5
    Target is down

    {{ $labels.job }} target is down.

  • TargetDown CE S6
    Target is down

    {{ $labels.job }} target is down.

  • TargetDown CE S7
    Target is down

    {{ $labels.job }} target is down.

  • TargetSampleLimitExceeded CE S6
    Scrapes are exceeding sample limit

    Target are down because of a sample limit exceeded.

  • TargetSampleLimitExceeded CE S7
    The sampling limit is close.

    The target is close to exceeding the sampling limit. less than 10% left to the limit

Модуль runtime-audit-engine

  • D8RuntimeAuditEngineNotScheduledInCluster EE S4
    Pods of runtime-audit-engine cannot be scheduled in the cluster.

    A number of runtime-audit-engine pods are not scheduled. Security audit is not fully operational.

    Consider checking state of the d8-runtime-audit-engine/runtime-audit-engine DaemonSet. kubectl -n d8-runtime-audit-engine get daemonset,pod --selector=app=runtime-audit-engine Get a list of nodes that have pods in an not Ready state.

    kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    

Модуль secret-copier

  • D8SecretCopierDeprecatedLabels CE S9
    Obsolete antiopa_secret_copier=yes label has been found.

    The secrets copier module has changed the service label for the original secrets in the default namespace.

    Soon we will abandon the old antiopa-secret-copier: "yes" label.

    You have to replace the antiopa-secret-copier: "yes" label with secret-copier.deckhouse.io/enabled: "" for all secrets that the secret-copier module uses in the default namespace.

Модуль snapshot-controller

  • D8SnapshotControllerPodIsNotReady CE S8
    The snapshot-controller Pod is NOT Ready.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-controller
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
  • D8SnapshotControllerPodIsNotRunning CE S8
    The snapshot-controller Pod is NOT Running.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-controller
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
  • D8SnapshotControllerTargetAbsent CE S8
    There is no snapshot-controller target in Prometheus.

    The recommended course of action:

    1. Check the Pod status: kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
    2. Or check the Pod logs: kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
  • D8SnapshotControllerTargetDown CE S8
    Prometheus cannot scrape the snapshot-controller metrics.

    The recommended course of action:

    1. Check the Pod status: kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
    2. Or check the Pod logs: kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
  • D8SnapshotValidationWebhookPodIsNotReady CE S8
    The snapshot-validation-webhook Pod is NOT Ready.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook
  • D8SnapshotValidationWebhookPodIsNotRunning CE S8
    The snapshot-validation-webhook Pod is NOT Running.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook

Модуль upmeter

  • D8SmokeMiniNotBoundPersistentVolumeClaims CE S9
    Smoke-mini has unbound or lost persistent volume claims.

    {{ $labels.persistentvolumeclaim }} persistent volume claim status is {{ $labels.phase }}.

    There is a problem with pv provisioning. Check the status of the pvc o find the problem: kubectl -n d8-upmeter get pvc {{ $labels.persistentvolumeclaim }}

    If you have no disk provisioning system in the cluster, you can disable ordering volumes for the some-mini through the module settings.

  • D8UpmeterAgentPodIsNotReady CE S6
    Upmeter agent is not Ready
  • D8UpmeterAgentReplicasUnavailable CE S6
    One or more Upmeter agent pods is NOT Running

    Check DaemonSet status: kubectl -n d8-upmeter get daemonset upmeter-agent -o json | jq .status

    Check the status of its pod: kubectl -n d8-upmeter get pods -l app=upmeter-agent -o json | jq '.items[] | {(.metadata.name):.status}'

  • D8UpmeterProbeGarbageConfigmap CE S9
    Garbage produced by basic probe is not being cleaned.

    Probe configmaps found.

    Upmeter agents should clean ConfigMaps produced by control-plane/basic probe. There should not be more configmaps than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the configmaps were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "basic-functionality") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete configmaps manually:

    kubectl -n d8-upmeter delete cm -l heritage=upmeter

  • D8UpmeterProbeGarbageDeployment CE S9
    Garbage produced by controller-manager probe is not being cleaned.

    Average probe deployments count per upmeter-agent pod: {{ $value }}.

    Upmeter agents should clean Deployments produced by control-plane/controller-manager probe. There should not be more deployments than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the deployments were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional, kube-controller-manager in particular.

    2. Delete deployments manually:

    kubectl -n d8-upmeter delete deploy -l heritage=upmeter

  • D8UpmeterProbeGarbageNamespaces CE S9
    Garbage produced by namespace probe is not being cleaned.

    Average probe namespace per upmeter-agent pod: {{ $value }}.

    Upmeter agents should clean namespaces produced by control-plane/namespace probe. There should not be more of these namespaces than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the namespaces were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "namespace") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete namespaces manually: kubectl -n d8-upmeter delete ns -l heritage=upmeter

  • D8UpmeterProbeGarbagePods CE S9
    Garbage produced by scheduler probe is not being cleaned.

    Average probe pods count per upmeter-agent pod: {{ $value }}.

    Upmeter agents should clean Pods produced by control-plane/scheduler probe. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the pods were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "scheduler") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete pods manually:

    kubectl -n d8-upmeter delete po -l upmeter-probe=scheduler

  • D8UpmeterProbeGarbagePodsFromDeployments CE S9
    Garbage produced by controller-manager probe is not being cleaned.

    Average probe pods count per upmeter-agent pod: {{ $value }}.

    Upmeter agents should clean Deployments produced by control-plane/controller-manager probe, and hence kube-controller-manager should clean their pods. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver or kube-controller-manager. Or, probably, the pods were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional, kube-controller-manager in particular.

    2. Delete pods manually:

    kubectl -n d8-upmeter delete po -l upmeter-probe=controller-manager

  • D8UpmeterProbeGarbageSecretsByCertManager CE S9
    Garbage produced by cert-manager probe is not being cleaned.

    Probe secrets found.

    Upmeter agents should clean certificates, and thus secrets produced by cert-manager should clean, too. There should not be more secrets than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver, or cert-manager, or upmeter itself. It is also possible, that the secrets were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "cert-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane and cert-manager are functional.

    2. Delete certificates manually, and secrets, if needed:

    kubectl -n d8-upmeter delete certificate -l upmeter-probe=cert-manager
    kubectl -n d8-upmeter get secret -ojson | jq -r '.items[] | .metadata.name' | grep upmeter-cm-probe | xargs -n 1 -- kubectl -n d8-upmeter delete secret
    
  • D8UpmeterServerPodIsNotReady CE S6
    Upmeter server is not Ready
  • D8UpmeterServerPodIsRestartingTooOften CE S9
    Upmeter server is restarting too often.

    Restarts for the last hour: {{ $value }}.

    Upmeter server should not restart too often. It should always be running and collecting episodes. Check its logs to find the problem: kubectl -n d8-upmeter logs -f upmeter-0 upmeter

  • D8UpmeterServerReplicasUnavailable CE S6
    One or more Upmeter server pods is NOT Running

    Check StatefulSet status: kubectl -n d8-upmeter get statefulset upmeter -o json | jq .status

    Check the status of its pod: kubectl -n d8-upmeter get pods upmeter-0 -o json | jq '.items[] | {(.metadata.name):.status}'

  • D8UpmeterSmokeMiniMoreThanOnePVxPVC CE S9
    Unnecessary smoke-mini volumes in cluster

    PV count per smoke-mini PVC: {{ $value }}.

    Smoke-mini PVs should be deleted when released. Probably smoke-mini storage class has Retain policy by default, or there is CSI/cloud issue.

    These PVs have no valuable data on them an should be deleted.

    The list of PVs: kubectl get pv | grep disk-smoke-mini.

  • D8UpmeterTooManyHookProbeObjects CE S9
    Too many UpmeterHookProbe objects in cluster

    Average UpmeterHookProbe count per upmeter-agent pod is {{ $value }}, but should be strictly 1.

    Some of the objects were left by old upmeter-agent pods due to Upmeter update or downscale.

    Leave only newest objects corresponding to upemter-agent pods, when the reason it investigated.

    See kubectl get upmeterhookprobes.deckhouse.io.

Модуль user-authn

  • D8DexAllTargetsDown CE S6
    Prometheus is unable to scrape Dex metrics.