Канал/версия:

Справка
Список алертов

Список алертов

На странице приведен список алертов мониторинга в Deckhouse Kubernetes Platform.

Список не содержит алертов подключаемых модулей.

Алерты сгруппированы по модулям. Справа от названия алерта указаны иконки минимальной редакции DKP в которой доступен алерт и уровня критичности алерта (severity).

Для каждого алерта приведено краткое описание (summary), раскрыв которое можно увидеть подробное описание алерта (description), при их наличии.

Критичность алерта

В описании алертов присутствует параметр Severity (S), означающий уровень критичности. Его значение варьируется от S1 до S9 и может расцениваться следующим образом:

S1 — максимальный уровень, критический сбой/авария (требуются незамедлительные действия);
S2 — высокий уровень, близкий к максимальному, возможная авария (необходимо быстрое реагирование);
S3 — средний уровень, потенциально серьёзная проблема (необходима проверка);
S4-S9 — низкий уровень. Есть проблема, но в целом работоспособность не нарушена.

Модуль admission-policy-engine

D8AdmissionPolicyEngineNotBootstrapped CE S7
Admission-policy-engine module hasn't been bootstrapped for 10 minutes.
The admission-policy-engine module couldn’t bootstrap.

Steps to troubleshoot:
1. Verify that the module’s components are up and running:
  
  d8 k get pods -n d8-admission-policy-engine
2. Check logs for issues, such as missing constraint templates or incomplete CRD creation:
  
  d8 k logs -n d8-system -lapp=deckhouse --tail=1000 | grep admission-policy-engine
OperationPolicyViolation CE S3
At least one object violates the configured cluster operation policies.
You have configured operation policies for the cluster, and one or more existing objects are violating these policies.

To identify violating objects:
- Run the following Prometheus query:
  
  count by (violating_namespace, violating_kind, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", source_type="OperationPolicy" } )
- Alternatively, check the admission-policy-engine Grafana dashboard.
PodSecurityStandardsViolation CE S3
At least one pod violates the configured cluster pod security standards.
You have configured Pod Security Standards, and one or more running pods are violating these standards.

To identify violating pods:
- Run the following Prometheus query:
  
  count by (violating_namespace, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", violating_namespace=~".*", violating_kind="Pod", source_type="PSS" } )
- Alternatively, check the admission-policy-engine Grafana dashboard.
SecurityPolicyViolation CE S3
At least one object violates the configured cluster security policies.
You have configured security policies for the cluster, and one or more existing objects are violating these policies.

To identify violating objects:
- Run the following Prometheus query:
  
  count by (violating_namespace, violating_kind, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", source_type="SecurityPolicy" } )
- Alternatively, check the admission-policy-engine Grafana dashboard.

Модуль cert-manager

CertmanagerCertificateExpired CE S4
Certificate NAMESPACE/NAME is not provided.
Certificate is not provided.

To check the certificate details, run the following command:
d8 k -n NAMESPACE describe certificate NAME
CertmanagerCertificateExpiredSoon CE S4
Certificate will expire soon.
The certificate NAMESPACE/NAME will expire in less than two weeks.

To check the certificate details, run the following command:
d8 k -n NAMESPACE describe certificate CERTIFICATE-NAME
CertmanagerCertificateOrderErrors CE S5
Cert-manager couldn't order a certificate.
Cert-manager received responses with the status code STATUS_REFERENCE when requesting SCHEME://HOST_NAMEPATH.

This can affect certificate ordering and prolongation in the future. For details, check the cert-manager logs using the following command:
d8 k -n d8-cert-manager logs -l app=cert-manager -c cert-manager

Модуль chrony

NodeTimeOutOfSync CE S5
Clock on the node NODE_NAME is drifting.
Time on the node NODE_NAME is out of sync and drifts apart from the NTP server clock by VALUE seconds.

To resolve the time synchronization issues:
- Fix network errors:
  
  Ensure the upstream time synchronization servers defined in the chrony configuration are available.
  
  Eliminate large packet loss and excessive latency to upstream time synchronization servers.
- Modify the NTP servers list defined in the chrony configuration.
NTPDaemonOnNodeDoesNotSynchronizeTime CE S5
NTP daemon on the node NODE_NAME haven't synchronized time for too long.
Steps to troubleshoot:
1. Check if the chrony pod is running on the node by executing the following command:
  
  d8 k -n d8-chrony --field-selector spec.nodeName="NODE_NAME" get pods
2. Verify the chrony daemon’s status by executing the following command:
  
  d8 k -n d8-chrony exec POD_NAME -- /opt/chrony-static/bin/chronyc sources
3. Resolve the time synchronization issues:
  
  Fix network errors:
  
  Ensure the upstream time synchronization servers defined in the chrony configuration are available.
  
  Eliminate large packet loss and excessive latency to upstream time synchronization servers.
  
  Modify the NTP servers list defined in the chrony configuration.

Модуль cloud-provider-yandex

D8YandexNatInstanceConnectionsQuotaUtilization CE S4

Connection quota utilization of the Yandex NAT instance exceeds 85% over the last 5 minutes.

The connection quota for the Yandex NAT instance has exceeded 85% utilization over the past 5 minutes.

To prevent potential issues, contact Yandex technical support and request an increase in the connection quota.

NATInstanceWithDeprecatedAvailabilityZone CE S9

NAT instance NAME is in a deprecated availability zone.

The NAT instance NAME is located in the availability zone ru-central1-c, which has been deprecated by Yandex Cloud. To resolve this issue, migrate the NAT instance to either ru-central1-a or ru-central1-b by following the instructions below.

The migration process involves irreversible changes and may result in a significant downtime. The duration typically depends on Yandex Cloud’s response time and can last several tens of minutes.

Migrate the NAT instance. To get providerClusterConfiguration.withNATInstance, run the following command:
```
d8 k -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerClusterConfiguration.withNATInstance'
```
- If you specified withNATInstance.natInstanceInternalAddress and/or withNATInstance.internalSubnetID in providerClusterConfiguration, remove them using the following command:
```
d8 system edit provider-cluster-configuration
```
- If you specified withNATInstance.externalSubnetID and/or withNATInstance.natInstanceExternalAddress in providerClusterConfiguration, change them to the appropriate values.
  
  To get the address and subnet ID, use the Yandex Cloud Console or CLI.
  
  To change withNATInstance.externalSubnetID and withNATInstance.natInstanceExternalAddress, run the following command:
```
d8 system edit provider-cluster-configuration
```

Run the appropriate edition and version of the Deckhouse installer container on the local machine. You may have to change the container registry address to do that. After that, perform the converge.

Get the appropriate edition and version of Deckhouse:

DH_VERSION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}')
DH_EDITION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]')
echo "DH_VERSION=$DH_VERSION DH_EDITION=$DH_EDITION"

Run the installer:

docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" registry.deckhouse.io/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash

Perform the converge:

dhctl converge --ssh-agent-private-keys=/tmp/.ssh/SSH_KEY_FILENAME --ssh-user=USERNAME --ssh-host MASTER-NODE-0-HOST

Update the route table.

Get the route table name:

d8 k -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.global.clusterConfiguration.cloud.prefix'

Get the NAT instance name:

d8 k -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerDiscoveryData.natInstanceName'

Get the NAT instance internal IP address:

yc compute instance list | grep -e "INTERNAL IP" -e NAT_INSTANCE_NAME_FROM_PREVIOUS_STEP

Update the route:

yc vpc route-table update --name ROUTE_TABLE_NAME_FROM_PREVIOUS_STEP --route "destination=0.0.0.0/0,next-hop=NAT_INSTANCE_INTERNAL_IP_FROM_PREVIOUS_STEP"

NodeGroupNodeWithDeprecatedAvailabilityZone CE S9
NodeGroup NODE_GROUP_NAME contains nodes in a deprecated availability zone.
Certain nodes in the node group NODE_GROUP_NAME are located in the availability zone ru-central1-c, which has been deprecated by Yandex Cloud.

Steps to troubleshoot:
1. Identify the nodes that need to be migrated by running the following command:
  
  d8 k get node -l "topology.kubernetes.io/zone=ru-central1-c"
2. Migrate your nodes, disks, and load balancers to one of the supported zones: ru-central1-a, ru-central1-b, or ru-central1-d. Refer to the Yandex migration guide for detailed instructions.
  
  You can’t migrate public IP addresses between zones. For details, refer to the migration guide.

Модуль cni-cilium

CiliumAgentEndpointsNotReady CE S4
Over 50% of all known endpoints aren't ready in agent NAMESPACE/POD_NAME.
For details, refer to the logs of the agent:
d8 k -n NAMESPACE logs POD_NAME
CiliumAgentMapPressureCritical CE S4

eBPF map MAP_NAME exceeds 90% utilization in agent NAMESPACE/POD_NAME.

The eBPF map resource utilization limit has almost been reached.

Check with the vendor for potential remediation steps.
CiliumAgentMetricNotFound CE S4
Agent NAMESPACE/POD_NAME isn't sending some metrics.
Steps to troubleshoot:
1. Check the logs of the agent:
  
  d8 k -n NAMESPACE logs POD_NAME
2. Verify the agent’s health status:
  
  d8 k -n NAMESPACE exec -ti POD_NAME cilium-health status
3. Compare the metrics with those of a neighboring agent.
Note that the absence of metrics can indirectly indicate that new pods can’t be created on the node due to connectivity issues with the agent.
CiliumAgentPolicyImportErrors CE S4
Agent NAMESPACE/POD_NAME fails to import policies.
For details, refer to the logs of the agent:
d8 k -n NAMESPACE logs POD_NAME
CiliumAgentUnreachableHealthEndpoints CE S4
Agent NAMESPACE/POD_NAME can't reach some of the node health endpoints.
For details, refer to the logs of the agent:
d8 k -n NAMESPACE logs POD_NAME
CniCiliumNonStandardVXLANPortFound CE S4

Cilium configuration uses a non-standard VXLAN port.

The Cilium configuration specifies a non-standard VXLAN port XXX. The recommended port number is XXX.

To resolve this issue, update the tunnel-port parameter in the cilium-configmap ConfigMap located in the d8-cni-cilium namespace to match the recommended port.

If you configured the non-standard port on purpose, ignore this alert.
CniCiliumOrphanEgressGatewayPolicyFound SE-PLUS S4

Orphaned EgressGatewayPolicy with an irrelevant EgressGateway name has been found.

The cluster contains an orphaned EgressGatewayPolicy named NAME with an irrelevant EgressGateway name.

To resolve this issue, verify the EgressGateway name specified in the EgressGatewayPolicy resource EGRESSGATEWAY and update it as needed.
D8NodeHasUnmetKernelRequirements CE S4
Nodes have unmet kernel requirements.
Deckhouse has detected that some nodes don’t meet the required kernel constraints. As a result, certain modules can’t run on these nodes.

Kernel requirements for each module:
- Cilium module: Kernel version must be >= 5.8.
To list all affected nodes, use the d8_node_kernel_does_not_satisfy_requirements == 1 expression in Prometheus.

Модуль control-plane-manager

D8ControlPlaneManagerPodNotRunning CE S6
Controller pod isn't running on node NODE_NAME.
The d8-control-plane-manager pod is either failing or hasn’t been scheduled on node NODE_NAME.

To resolve this issue, check the status of the kube-system/d8-control-plane-manager DaemonSet and its pods by running the following command:
d8 k -n kube-system get daemonset,pod --selector=app=d8-control-plane-manager
D8EtcdDatabaseHighFragmentationRatio CE S7
etcd database size in use is less than 50% of the allocated storage.
The etcd database size in use on instance INSTANCE_NAME is less than 50% of the allocated disk space, indicating potential fragmentation. Additionally, the total storage size exceeds 75% of the configured quota.

To resolve this issue, defragment the etcd database by running the following command:
d8 k -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/ca.crt \ --key /etc/kubernetes/pki/etcd/ca.key \ --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
D8EtcdExcessiveDatabaseGrowth CE S4

etcd cluster database is growing rapidly.

Based on the growth rate observed over the last six hours, Deckhouse predicts that the etcd database will run out of disk space within one day on instance INSTANCE_NAME.

To prevent disruptions, investigate the cause and take necessary action.

D8KubeEtcdDatabaseSizeCloseToTheLimit CE S3

etcd database size is approaching the limit.

The etcd database size on NODE_NAME is nearing its size limit. This may be caused by a high number of events, such as pod evictions or the recent creation of numerous resources in the cluster.

Possible solutions:

Defragment the etcd database by running the following command:

d8 k -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/ca.crt \
  --key /etc/kubernetes/pki/etcd/ca.key \
  --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s

Increase node memory. Starting from 24 GB, quota-backend-bytes will increase by 1 GB for every extra 8 GB of memory.

Example:

Node memory	quota-backend-bytes
16 GB	2147483648 (2 GB)
24 GB	3221225472 (3 GB)
32 GB	4294967296 (4 GB)
40 GB	5368709120 (5 GB)
48 GB	6442450944 (6 GB)
56 GB	7516192768 (7 GB)
64 GB	8589934592 (8 GB)
72 GB	8589934592 (8 GB)
…	…

D8KubernetesStaleTokensDetected CE S8
Stale service account tokens detected.
This issue may occur if an application reads the token only at startup and does not reload it periodically. As a result, an outdated token might be used, leading to security breach and authentication failures.

Recommended actions:
- Ensure your application is configured to periodically reload the token from the file system.
- Verify that you are using an up-to-date client library that supports automatic token rotation.
Note that currently these tokens are not blocked because the --service-account-extend-token-expiration flag is enabled by default (Default: true). With this flag enabled, admission-injected tokens are extended up to 1 year during token generation to facilitate a safe transition from legacy tokens to the bound service account token feature, ignoring the value of service-account-max-token-expiration.

For further investigation: Log into the server with label instance=INSTANCE_NAME and inspect the audit log using the following command:
jq 'select(.annotations["authentication.k8s.io/stale-token"]) | {auditID, stageTimestamp, requestURI, verb, user: .user.username, stale_token: .annotations["authentication.k8s.io/stale-token"]}' /var/log/kube-audit/audit.log
If you do not see the necessary logs, set settings.apiserver.auditPolicyEnabled in control-plane-manager ModuleConfig according to the documentation and add an additional audit policy to log actions of all service accounts.
- level: Metadata omitStages: - RequestReceived userGroups: - system:serviceaccounts
Example of applying an additional audit policy:
d8 k apply -f - <<EOF apiVersion: v1 kind: Secret metadata: name: audit-policy namespace: kube-system data: audit-policy.yaml: YXBpVmVyc2lvbjogYXVkaXQuazhzLmlvL3YxCmtpbmQ6IFBvbGljeQpydWxlczoKLSBsZXZlbDogTWV0YWRhdGEKICBvbWl0U3RhZ2VzOgogIC0gUmVxdWVzdFJlY2VpdmVkCiAgdXNlckdyb3VwczoKICAtIHN5c3RlbTpzZXJ2aWNlYWNjb3VudHM= EOF
D8KubernetesVersionIsDeprecated CE S7

Kubernetes version VERSION_NUMBER is deprecated.

The current Kubernetes version VERSION_NUMBER has been deprecated, and support for it will be removed in upcoming releases.

Please migrate to the next kubernetes version (at least 1.30)

Read documentation about how to update the Kubernetes version in the cluster.

D8NeedDecreaseEtcdQuotaBackendBytes CE S6

Deckhouse suggests reducing quota-backend-bytes.

When the control plane node memory is reduced, Deckhouse may suggest reducing quota-backend-bytes. While Deckhouse is capable of automatically increasing this value, reducing it must be done manually.

To modify quota-backend-bytes, set the controlPlaneManager.etcd.maxDbSize parameter. Before setting a new value, check the current database usage on every control plane node by running:

for pod in $(d8 k get pod -n kube-system -l component=etcd,tier=control-plane -o name); do
  d8 k -n kube-system exec -ti "$pod" -- /usr/bin/etcdctl \
    --cacert /etc/kubernetes/pki/etcd/ca.crt \
    --cert /etc/kubernetes/pki/etcd/ca.crt \
    --key /etc/kubernetes/pki/etcd/ca.key \
    endpoint status -w json | jq --arg a "$pod" -r \
    '.[0].Status.dbSize / 1024 / 1024 | tostring | $a + ": " + . + " MB"';
done

Things to note:

The maximum value for controlPlaneManager.etcd.maxDbSize is 8 GB.
If control plane nodes have less than 24 GB, set controlPlaneManager.etcd.maxDbSize to 2 GB.

Starting from 24 GB, quota-backend-bytes will increase by 1 GB for every extra 8 GB of memory.

Example:

Node memory	quota-backend-bytes
16 GB	2147483648 (2 GB)
24 GB	3221225472 (3 GB)
32 GB	4294967296 (4 GB)
40 GB	5368709120 (5 GB)
48 GB	6442450944 (6 GB)
56 GB	7516192768 (7 GB)
64 GB	8589934592 (8 GB)
72 GB	8589934592 (8 GB)
…	…

Модуль documentation

ModuleConfigDeprecated CE S9
Deprecated deckhouse-web ModuleConfig detected.
The deckhouse-web module has been renamed to documentation, and a new documentation ModuleConfig is generated automatically.

Steps to troubleshoot:
1. Remove the deprecated deckhouse-web ModuleConfig from the CI deployment process.
2. Delete it using the following command:
  
  d8 k delete mc deckhouse-web

Модуль extended-monitoring

CertificateSecretExpired CE S8
Certificate has expired.
A certificate in Secret SECRET_NAMESPACE/SECRET_NAME has expired.

Ways to resolve:
- If the certificate is managed manually, upload a new certificate.
- If the certificate is managed by the cert-manager module, inspect the certificate resource:
  
  Retrieve the certificate name from the Secret:
  
  cert=$(d8 k get secret -n SECRET_NAMESPACE SECRET_NAME -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
  
  Check the certificate status and investigate why it hasn’t been updated:
  
  d8 k describe cert -m SECRET_NAMESPACE "$cert"
CertificateSecretExpiredSoon CE S8
Certificate is expiring soon.
A certificate in Secret SECRET_NAMESPACE/SECRET_NAME will expire in less than two weeks.

Ways to resolve:
- If the certificate is managed manually, upload a new certificate.
- If the certificate is managed by the cert-manager module, inspect the certificate resource:
  
  Retrieve the certificate name from the Secret:
  
  cert=$(d8 k get secret -n SECRET_NAMESPACE SECRET_NAME -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
  
  Check the certificate status and investigate why it hasn’t been updated:
  
  d8 k describe cert -n SECRET_NAMESPACE "$cert"
CronJobAuthenticationFailure CE S7
Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.
Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
CronJobAuthorizationFailure CE S7
Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.
Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
CronJobBadImageFormat CE S7
The IMAGE_NAME image name is incorrect.
Deckhouse has detected that the IMAGE_NAME image name is incorrect.

To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
CronJobFailed CE S5
Job NAMESPACE/JOB_NAME failed in CronJob NAMESPACE/CRONJOB.
Deckhouse has detected that Job NAMESPACE/JOB_NAME failed in CronJob NAMESPACE/CRONJOB.

Steps to resolve:
1. Print the job details:
  
  d8 k -n NAMESPACE describe job JOB_NAME
2. Check the job status:
  
  d8 k -n NAMESPACE get job JOB_NAME
3. Check the status of pods created by the job:
  
  d8 k -n NAMESPACE get pods -l job-name=JOB_NAME
CronJobImageAbsent CE S7
The IMAGE_NAME image is missing from the registry.
Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
CronJobPodsNotCreated CE S5
Pods set in CronJob NAMESPACE/JOB_NAME haven't been created.
Deckhouse has detected that the pods set in CronJob NAMESPACE/CRONJOB still haven’t been created.

Steps to resolve:
1. Print the job details:
  
  d8 k -n NAMESPACE describe job JOB_NAME
2. Check the job status:
  
  d8 k -n NAMESPACE get job JOB_NAME
3. Check the status of pods created by the job:
  
  d8 k -n NAMESPACE get pods -l job-name=JOB_NAME
CronJobRegistryUnavailable CE S7
The container registry is not available for the IMAGE_NAME image.
Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
CronJobSchedulingError CE S6
CronJob NAMESPACE/CRONJOB failed to schedule on time.
Deckhouse has detected that CronJob NAMESPACE/CRONJOB failed to schedule on time.
- Current schedule: XXX
- Last scheduled time: XXX%
- Next projected schedule time: XXX%
CronJobUnknownError CE S7
An unknown error occurred with the IMAGE_NAME image.
Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:
- The NAMESPACE namespace.
- The CronJob NAME.
- The CONTAINER_NAME container in the registry.
To resolve this issue, review the exporter logs:
d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
D8CertExporterPodIsNotReady CE S8
The x509-certificate-exporter pod isn't ready.
Steps to resolve:
1. Retrieve the deployment details:
  
  d8 k -n d8-monitoring describe deploy x509-certificate-exporter
2. Check the pod status and investigate why it’s not ready:
  
  d8 k -n d8-monitoring describe pod -l app=x509-certificate-exporter
D8CertExporterPodIsNotRunning CE S8
The x509-certificate-exporter pod isn't running.
Steps to resolve:
1. Retrieve the deployment details:
  
  d8 k -n d8-monitoring describe deploy x509-certificate-exporter
2. Check the pod status and investigate why it’s not running:
  
  d8 k -n d8-monitoring describe pod -l app=x509-certificate-exporter

D8CertExporterTargetAbsent CE S8

There is no x509-certificate-exporter target in Prometheus.

Ways to resolve:

Check the pod status:

d8 k -n d8-monitoring get pod -l app=x509-certificate-exporter

Check the pod logs:

d8 k -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter

D8CertExporterTargetDown CE S8

Prometheus can't scrape x509-certificate-exporter metrics.

Ways to resolve:

Check the pod status:

d8 k -n d8-monitoring get pod -l app=x509-certificate-exporter

Check the pod logs:

d8 k -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter

D8ImageAvailabilityExporterMalfunctioning CE S8
The image-availability-exporter has crashed.
The image-availability-exporter has failed to perform any image availability checks in the container registry for over 20 minutes.

To investigate the issue, review the exporter’s logs:
d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
D8ImageAvailabilityExporterPodIsNotReady CE S8
The image-availability-exporter pod is not ready.
Deckhouse has detected that the image-availability-exporter pod is not ready. As a result, the images listed in the image field aren’t checked for availability in the container registry.

Steps to resolve:
1. Retrieve the deployment details:
  
  d8 k -n d8-monitoring describe deploy image-availability-exporter
2. Check the pod status and investigate why it isn’t Ready:
  
  d8 k -n d8-monitoring describe pod -l app=image-availability-exporter
D8ImageAvailabilityExporterPodIsNotRunning CE S8
The image-availability-exporter pod is not running.
Deckhouse has detected that the image-availability-exporter pod is not running. As a result, the images listed in the image field aren’t checked for availability in the container registry.

Steps to resolve:
1. Retrieve the deployment details:
  
  d8 k -n d8-monitoring describe deploy image-availability-exporter
2. Check the pod status and investigate why it isn’t running:
  
  d8 k -n d8-monitoring describe pod -l app=image-availability-exporter
D8ImageAvailabilityExporterTargetAbsent CE S8
The image-availability-exporter target is missing from Prometheus.
Deckhouse has detected that the image-availability-exporter target is missing from Prometheus.

Steps to resolve:
1. Check the pod status:
  
  d8 k -n d8-monitoring get pod -l app=image-availability-exporter
2. Check the pod logs:
  
  d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
D8ImageAvailabilityExporterTargetDown CE S8
Prometheus can't scrape metrics of image-availability-exporter.
Deckhouse has detected that Prometheus is unable to scrape metrics of image-availability-exporter.

Steps to resolve:
1. Check the pod status:
  
  d8 k -n d8-monitoring get pod -l app=image-availability-exporter
2. Check the pod logs:
  
  d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
DaemonSetAuthenticationFailure CE S7
Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.
Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
DaemonSetAuthorizationFailure CE S7
Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.
Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
DaemonSetBadImageFormat CE S7
The IMAGE_NAME image name is incorrect.
Deckhouse has detected that the IMAGE_NAME image name is incorrect.

To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
DaemonSetImageAbsent CE S7
The IMAGE_NAME image is missing from the registry.
Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
DaemonSetRegistryUnavailable CE S7
The container registry is not available for the IMAGE_NAME image.
Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
DaemonSetUnknownError CE S7
An unknown error occurred with the IMAGE_NAME image.
Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:
- The NAMESPACE namespace.
- The DaemonSet NAME.
- The CONTAINER_NAME container in the registry.
To resolve this issue, review the exporter logs:
d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
DeploymentAuthenticationFailure CE S7
Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.
Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
DeploymentAuthorizationFailure CE S7
Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.
Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
DeploymentBadImageFormat CE S7
The IMAGE_NAME image name is incorrect.
Deckhouse has detected that the IMAGE_NAME image name is incorrect.

To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
DeploymentImageAbsent CE S7
The IMAGE_NAME image is missing from the registry.
Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
DeploymentRegistryUnavailable CE S7
The container registry is not available for the IMAGE_NAME image.
Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
DeploymentUnknownError CE S7
An unknown error occurred with the IMAGE_NAME image.
Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:
- The NAMESPACE namespace.
- The Deployment NAME.
- The CONTAINER_NAME container in the registry.
To resolve this issue, review the exporter logs:
d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
ExtendedMonitoringTargetDown CE S5
Extended monitoring is unavailable.
The pod running extended-monitoring-exporter is currently unavailable.

As a result, the following alerts will not be triggered:
- Low disk space and inode usage on volumes.
- CPU overloads and container throttling.
- 500 errors on Ingress.
- Insufficient replicas of Deployments, StatefulSets, and DaemonSets.
- Other alerts associated with this exporter.
To resolve this issue, investigate its possible causes:
1. Print detailed information about the extended-monitoring-exporter deployment:
  
  d8 k -n d8-monitoring describe deploy extended-monitoring-exporter
2. Print detailed information about the pods associated with the extended-monitoring-exporter:
  
  d8 k -n d8-monitoring describe pod -l app=extended-monitoring-exporter
IngressResponses5xx CE S4

URL VHOST/LOCATION on Ingress INGRESS has more than XXX% of 5xx responses from the backend.

Deckhouse has detected that URL VHOST/LOCATION on Ingress INGRESS, using service SERVICE_NAME on port SERVICE_PORT has more than XXX% of 5xx responses from the backend.

Current rate of 5xx responses: VALUE%
IngressResponses5xx CE S5

URL VHOST/LOCATION on Ingress INGRESS has more than XXX% of 5xx responses from the backend.

Deckhouse has detected that URL VHOST/LOCATION on Ingress INGRESS, using service SERVICE_NAME on port SERVICE_PORT, has more than XXX% of 5xx responses from the backend.

Current rate of 5xx responses: VALUE%
KubernetesDaemonSetNotUpToDate CE S9
There were VALUE outdated pods in DaemonSet NAMESPACE/DAEMONSET_NAME over the last 15 minutes.
Deckhouse has detected VALUE outdated pods in DaemonSet NAMESPACE/DAEMONSET_NAME over the last 15 minutes.

Steps to resolve:
1. Check the DaemonSet’s status:
  
  d8 k -n NAMESPACE get ds DAEMONSET_NAME
2. Analyze the DaemonSet’s description:
  
  d8 k -n NAMESPACE describe ds DAEMONSET_NAME
3. If the parameter Number of Nodes Scheduled with Up-to-date Pods does not match Current Number of Nodes Scheduled, check the DaemonSet’s updateStrategy:
  
  d8 k -n NAMESPACE get ds DAEMONSET_NAME -o json | jq '.spec.updateStrategy'
  
  If updateStrategy is set to OnDelete, the DaemonSet is updated only when pods are deleted.

KubernetesDaemonSetReplicasUnavailable CE S5

No available replicas remaining in DaemonSet NAMESPACE/DAEMONSET_NAME.

Deckhouse has detected that there are no available replicas remaining in DaemonSet NAMESPACE/DAEMONSET_NAME.

List of unavailable pods:

XXXXXX, XXXPOD_NAMEXXX

If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.

d8 k -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'

KubernetesDaemonSetReplicasUnavailable CE S6
The number of unavailable replicas in DaemonSet NAMESPACE/DAEMONSET_NAME exceeds the threshold.
Deckhouse has detected that the number of unavailable replicas in DaemonSet NAMESPACE/DAEMONSET_NAME exceeds the threshold.
- Current number: VALUE unavailable replica(s).
- Threshold number: XXX unavailable replica(s).
List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.
d8 k -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
KubernetesDeploymentReplicasUnavailable CE S5
No available replicas remaining in deployment NAMESPACE/DEPLOYMENT_NAME.
Deckhouse has detected that there are no available replicas remaining in deployment NAMESPACE/DEPLOYMENT_NAME.

List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
KubernetesDeploymentReplicasUnavailable CE S6
The number of unavailable replicas in deployment NAMESPACE/DEPLOYMENT_NAME exceeds spec.strategy.rollingupdate.maxunavailable.
Deckhouse has detected that the number of unavailable replicas in deployment NAMESPACE/DEPLOYMENT_NAME exceeds the value set in spec.strategy.rollingupdate.maxunavailable.
- Current number: VALUE unavailable replica(s).
- Threshold number: XXX unavailable replica(s).
List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
KubernetesStatefulSetReplicasUnavailable CE S5
No ready replicas remaining in StatefulSet NAMESPACE/STATEFULSET.
Deckhouse has detected that there are no ready replicas remaining in StatefulSet NAMESPACE/STATEFULSET.

List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
KubernetesStatefulSetReplicasUnavailable CE S6
The number of unavailable replicas in StatefulSet NAMESPACE/STATEFULSET exceeds the threshold.
Deckhouse has detected that the number of unavailable replicas in StatefulSet NAMESPACE/STATEFULSET exceeds the threshold.
- Current number: VALUE unavailable replica(s).
- Threshold number: XXX unavailable replica(s).
List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
LoadAverageHigh CE S4
Average load on node NODE_NAME is too high.
Over the last 5 minutes, the average load on node NODE_NAME has been higher than XXX per core.

There are more processes in the queue than the CPU can handle.

Possible causes:
- A process has created too many threads or child processes.
- The CPU is overloaded.
LoadAverageHigh CE S5
Average load on node NODE_NAME is too high.
Over the last 30 minutes, the average load on node NODE_NAME has been higher than or equal to XXX per core.

There are more processes in the queue than the CPU can handle.

Possible causes:
- A process has created too many threads or child processes.
- The CPU is overloaded.
NodeDiskBytesUsage CE S5
Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.
Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Current storage usage: VALUE%

Steps to resolve:
1. Retrieve disk usage information on the node:
  
  ncdu -x MOUNTPOINT
2. If the output shows high disk usage in the /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ directory, identify the pods with the highest usage:
  
  crictl stats -o json | jq '.stats[] | select((.writableLayer.usedBytes.value | tonumber) > 1073741824) | { meta: .attributes.labels, diskUsage: ((.writableLayer.usedBytes.value | tonumber) / 1073741824 * 100 | round / 100 | tostring + " GiB")}'
NodeDiskBytesUsage CE S6
Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.
Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Current storage usage: VALUE%

Steps to resolve:
1. Retrieve disk usage information on the node:
  
  ncdu -x MOUNTPOINT
2. If the output shows high disk usage in the /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ directory, identify the pods with the highest usage:
  
  crictl stats -o json | jq '.stats[] | select((.writableLayer.usedBytes.value | tonumber) > 1073741824) | { meta: .attributes.labels, diskUsage: ((.writableLayer.usedBytes.value | tonumber) / 1073741824 * 100 | round / 100 | tostring + " GiB")}'
NodeDiskInodesUsage CE S5

Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Current storage usage: VALUE%
NodeDiskInodesUsage CE S6

Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

Current storage usage: VALUE%
PersistentVolumeClaimBytesUsage CE S4
PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.
Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

Current volume storage usage: VALUE%

PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:
XXXXXX, XXXPOD_NAMEXXX
PersistentVolumeClaimBytesUsage CE S5
PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.
Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

Currently volume storage usage: VALUE%

PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:
XXXXXX, XXXPOD_NAMEXXX
PersistentVolumeClaimInodesUsed CE S4
PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.
Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

Current volume inode usage: VALUE%

PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:
XXXXXX, XXXPOD_NAMEXXX
PersistentVolumeClaimInodesUsed CE S5
PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.
Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

Current volume inode usage: VALUE%

PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:
XXXXXX, XXXPOD_NAMEXXX
StatefulSetAuthenticationFailure CE S7
Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.
Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
StatefulSetAuthorizationFailure CE S7
Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.
Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
StatefulSetBadImageFormat CE S7
The IMAGE_NAME image name is incorrect.
Deckhouse has detected that the IMAGE_NAME image name is incorrect.

To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
StatefulSetImageAbsent CE S7
The IMAGE_NAME image is missing from the registry.
Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
StatefulSetRegistryUnavailable CE S7
The container registry is not available for the IMAGE_NAME image.
Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

To resolve this issue, investigate the possible causes in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
StatefulSetUnknownError CE S7
An unknown error occurred with the IMAGE_NAME image.
Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:
- The NAMESPACE namespace.
- The StatefulSet NAME.
- The CONTAINER_NAME container in the registry.
To resolve this issue, review the exporter logs:
d8 k -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter

Модуль ingress-nginx

D8NginxIngressKruiseControllerPodIsRestartingTooOften CE S8
Too many Kruise controller restarts detected.
VALUE Kruise controller restarts detected in the last hour.

Excessive Kruise controller restarts indicate that something is wrong. Normally, it should be up and running all the time.

Steps to resolve:
1. Check events associated with kruise-controller-manager in the d8-ingress-nginx namespace. Look for issues related to node failures or memory shortages (OOM events):
  
  d8 k -n d8-ingress-nginx get events | grep kruise-controller-manager
2. Analyze the controller’s pod descriptions to identify restarted containers and possible causes. Pay attention to exit codes and other details:
  
  d8 k -n d8-ingress-nginx describe pod -lapp=kruise,control-plane=controller-manager
3. In case the kruise container has restarted, get a list of relevant container logs to identify any meaningful errors:
  
  d8 k -n d8-ingress-nginx logs -lapp=kruise,control-plane=controller-manager -c kruise

DeprecatedGeoIPVersion CE S9

Deprecated GeoIP version 1 is used in the cluster.

An IngressNginxController and/or Ingress object in the cluster is using variables from the deprecated NGINX GeoIPv1 module. Support for this module has been discontinued in Ingress NGINX Controller version 1.10 and higher.

It’s recommended that you update your configuration to use the GeoIPv2 module.

To get a list of the IngressNginxControllers using GeoIPv1 variables, run the following command:

d8 k get ingressnginxcontrollers.deckhouse.io -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | .metadata.name'

To get a list of the Ingress objects using GeoIPv1 variables, run the following command:

d8 k get ingress -A -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | "\(.metadata.namespace)/\(.metadata.name)"' | sort | uniq

GeoIPDownloadErrorDetected CE S4

GeoIP DB download error in controller CONTROLLER_NAME.

The controller CONTROLLER_NAME failed to download the GeoIP database.

Reason: XXX

Type: ERROR_TYPE

If this issue persists, investigate network connectivity or database availability.
NginxIngressConfigTestFailed CE S4
Configuration test failed on Ingress NGINX CONTROLLER_NAMESPACE/CONTROLLER_NAME.
The configuration test (nginx -t) for the CONTROLLER_NAME Ingress controller in the CONTROLLER_NAMESPACE namespace has failed.

Steps to resolve:
1. Check the controller logs:
  
  d8 k -n CONTROLLER_NAMESPACE logs CONTROLLER_POD_NAME -c controller
2. Find the most recently created Ingress in the cluster:
  
  d8 k get ingress --all-namespaces --sort-by="metadata.creationTimestamp"
3. Check for errors in the configuration-snippet or server-snippet annotations.
NginxIngressDaemonSetNotUpToDate CE S9
There were VALUE outdated pods in Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME over the last 20 minutes.
Deckhouse has detected VALUE outdated pods in Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME over the last 20 minutes.

Steps to resolve:
1. Check the DaemonSet’s status:
  
  d8 k -n NAMESPACE get ads DAEMONSET_NAME
2. Analyze the DaemonSet’s description:
  
  d8 k -n NAMESPACE describe ads DAEMONSET_NAME
3. If the parameter Number of Nodes Scheduled with Up-to-date Pods does not match Current Number of Nodes Scheduled, check the ‘nodeSelector’ and ‘toleration’ settings of the corresponding Ingress NGINX Controller and compare them to the ‘labels’ and ‘taints’ settings of the relevant nodes.
NginxIngressDaemonSetReplicasUnavailable CE S4
No available replicas remaining in Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME.
Deckhouse has detected that there are no available replicas remaining in Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME.

List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.
d8 k -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
NginxIngressDaemonSetReplicasUnavailable CE S6
Some replicas of Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME are unavailable.
Deckhouse has detected that some replicas of Ingress NGINX DaemonSet NAMESPACE/DAEMONSET_NAME are unavailable.

Current number: VALUE unavailable replica(s).

List of unavailable pods:
XXXXXX, XXXPOD_NAMEXXX
If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.
d8 k -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
NginxIngressPodIsRestartingTooOften CE S4

Too many Ingress NGINX restarts detected.

VALUE Ingress NGINX Controller restarts detected in the last hour.

Excessive Ingress NGINX restarts indicate that something is wrong. Normally, it should be up and running all the time.
NginxIngressProfilingIsEnabled CE S4

Warning: Profiling mode is enabled in the Ingress NGINX Controller "XXX".

Profiling mode is enabled for the Ingress NGINX Controller “XXX”. This may increase memory consumption, slow down request processing, and the process may run as root. It is recommended to disable profiling unless it is actively needed for debugging or performance analysis. To disable profiling, set nginxProfilingEnabled: false in the ingressnginxcontroller resource configuration for this controller.
NginxIngressProtobufExporterHasErrors CE S8
The Ingress NGINX sidecar container with protobuf_exporter has ERROR_TYPE errors.
Deckhouse has detected that the Ingress NGINX sidecar container with protobuf_exporter has ERROR_TYPE errors.

To resolve the issue, check the Ingress controller’s logs:
d8 k -n d8-ingress-nginx logs $(d8 k -n d8-ingress-nginx get pods -l app=controller,name=CONTROLLER_NAME -o wide | grep NODE_NAME | awk '{print $1}') -c protobuf-exporter
NginxIngressSslExpired CE S4
Certificate has expired.
The SSL certificate for HOST_NAME in the NAMESPACE namespace has expired.

To verify the certificate, run the following command:
d8 k -n NAMESPACE get secret SECRET_NAME -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
The site at https://HOST_NAME is not accessible.
NginxIngressSslWillExpire CE S5
Certificate is expiring soon.
The SSL certificate for HOST_NAME in the NAMESPACE namespace will expire in less than two weeks.

To verify the certificate, run the following command:
d8 k -n NAMESPACE get secret SECRET_NAME -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
NginxIngressValidationIsDisabled CE S4
Warning: Ingress resource validation in the Ingress NGINX Controller is currently disabled.
Validation is disabled to reduce load on the master nodes, as it requires additional resources. To re-enable validation, remove the annotation network.deckhouse.io/ingress-nginx-validation-suspended from the ingressnginxcontroller resource.

To find which IngressNginxController resources have the annotation, use the following command:
d8 k get ingressnginxcontrollers.deckhouse.io -o json | jq -r '.items[] | select(.metadata.annotations."network.deckhouse.io/ingress-nginx-validation-suspended" != null) | "\(.metadata.name)"'

Модуль istio

D8IstioActualDataPlaneVersionNotEqualDesired EE S8
There are pods with Istio data plane version VERSION_NUMBER, but desired version is XXX.
There are pods in the NAMESPACE namespace with Istio data plane version VERSION_NUMBER, while the desired version is XXX. As a result, the Istio version will be changed after the pod is restarted.

To resolve the issue, use the following cheat sheet:
### Namespace-wide configuration # `istio.io/rev=vXYZ`: Use a specific revision. # `istio-injection=enabled`: Use the global revision. d8 k get ns NAMESPACE --show-labels ### Pod-wide configuration d8 k -n NAMESPACE get pods -l istio.io/rev=DESIRED_VISION
D8IstioActualVersionIsNotInstalled EE S4
The control plane version for pods with injected sidecars isn't installed.
There are pods in the NAMESPACE namespace with injected sidecars of version VERSION_NUMBER (revision REVISION_NUMBER), but the corresponding control plane version is not installed. As a result, these pods have lost synchronization with the state in Kubernetes.

To resolve this issue, install the required control plane version. Alternatively, update the namespace or pod configuration to match an installed control plane version.

To identify orphaned pods, run the following command:
d8 k -n NAMESPACE get pods -l 'service.istio.io/canonical-name' -o json | jq --arg revision REVISION_NUMBER '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
D8IstioAdditionalControlplaneDoesntWork CE S4
Additional control plane isn't working.
Deckhouse has detected that the additional Istio control plane ISTIO_REVISION_LABEL isn’t working.

As a result, sidecar injection for pods with ISTIO_REVISION_LABEL isn’t working as well.

To check the status of the control plane pods, run the following command:
d8 k get pods -n d8-istio -l istio.io/rev=ISTIO_REVISION_LABEL
D8IstioDataPlaneVersionMismatch EE S8
There are pods with data plane version different from the control plane version.
There are pods in the NAMESPACE namespace with Istio data plane version VERSION_NUMBER, which is different from the control plane version DESIRED_VERSION.

Steps to resolve the issue:
1. Restart affected pods and use the following PromQL query to get a full list:
  
  max by (namespace, dataplane_pod) (d8_istio_dataplane_metadata{full_version="VERSION_NUMBER"})
2. Use the automatic Istio data plane upgrade described in the guide.
D8IstioDataPlaneWithoutIstioInjectionConfigured EE S4
Detected pods with Istio sidecars but istio-injection isn't configured.
There are pods in the NAMESPACE namespace with Istio sidecars, but istio-injection isn’t configured. As a result, these pods will lose their Istio sidecars after being recreated.

To identify the affected pods, run the following command:
d8 k -n NAMESPACE get pods -o json | jq -r --arg revision REVISION_NUMBER '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
D8IstioDeprecatedIstioVersionInstalled CE

The installed Istio version has been deprecated.

Deckhouse has detected that a deprecated Istio version VERSION_NUMBER is installed.

Support for this version will be removed in upcoming Deckhouse releases. The higher the alert severity, the greater the probability of support being discontinued.

To learn how to upgrade Istio, refer to the upgrade guide.
D8IstioDesiredVersionIsNotInstalled EE S6
Desired control plane version isn't installed.
There is a desired Istio control plane version XXX (revision REVISION_NUMBER) configured for pods in the NAMESPACE namespace, but that version isn’t installed. As a result, pods can’t be recreated in the NAMESPACE namespace.

To resolve this issue, install the desired control plane version. Alternatively, update the namespace or pod configuration to match an installed control plane version.

Use the following cheat sheet:
### Namespace-wide configuration # `istio.io/rev=vXYZ`: Use a specific revision. # `istio-injection=enabled`: Use the global revision. d8 k get ns NAMESPACE --show-labels ### Pod-wide configuration d8 k -n NAMESPACE get pods -l istio.io/rev=REVISION_NUMBER

D8IstioFederationMetadataEndpointDoesntWork EE S6

Federation metadata endpoint failed.

The metadata endpoint ENDPOINT_NAME for IstioFederation FEDERATION_NAME has failed to fetch via the Deckhouse hook.

To reproduce the request to the public endpoint, run the following command:

curl ENDPOINT_NAME

To reproduce the request to private endpoints (run from the Deckhouse pod), run the following:

KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
REMOTE_CLUSTER_UUID="$(d8 k get istiofederation FEDERATION_NAME -o json | jq -r .status.metadataCache.public.clusterUUID)"
TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-federation --ttl 1h)"
curl -H "Authorization: Bearer $TOKEN" ENDPOINT_NAME

D8IstioGlobalControlplaneDoesntWork CE S4
Global control plane isn't working.
Deckhouse has detected that the global Istio control plane ISTIO_REVISION_LABEL isn’t working.

As a result, sidecar injection for pods with global revision isn’t working as well, and the validating webhook for Istio resources is absent.

To check the status of the control plane pods, run the following command:
d8 k get pods -n d8-istio -l istio.io/rev=ISTIO_REVISION_LABEL

D8IstioMulticlusterMetadataEndpointDoesntWork EE S6

Multicluster metadata endpoint failed.

The metadata endpoint ENDPOINT_NAME for IstioMulticluster MULTICLUSTER_NAME has failed to fetch via the Deckhouse hook.

To reproduce the request to the public endpoint, run the following command:

curl ENDPOINT_NAME

To reproduce the request to private endpoints (run from the d8-system/deckhouse pod), run the following:

KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
REMOTE_CLUSTER_UUID="$(d8 k get istiomulticluster MULTICLUSTER_NAME -o json | jq -r .status.metadataCache.public.clusterUUID)"
TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-multicluster --ttl 1h)"
curl -H "Authorization: Bearer $TOKEN" ENDPOINT_NAME

D8IstioMulticlusterRemoteAPIHostDoesntWork EE S6
Multicluster remote API host health check failed.
The remote API host API_HOST for IstioMulticluster MULTICLUSTER_NAME has failed the health check performed by the Deckhouse monitoring hook.

To reproduce the request (run from the d8-system/deckhouse pod), run the following:
TOKEN="$(deckhouse-controller module values istio -o json | jq -r --arg ah API_HOST '.internal.multiclusters[]| select(.apiHost == $ah)| .apiJWT ')" curl -H "Authorization: Bearer $TOKEN" https://API_HOST/version
D8IstioOperatorReconcileError CE S5
The istio-operator is unable to reconcile Istio control plane setup.
Deckhouse has detected an error in the istio-operator reconciliation loop.

To investigate the issue, check the operator logs:
d8 k -n d8-istio logs -l app=operator,revision=REVISION_NUMBER
D8IstioPodsWithoutIstioSidecar EE S4
Detected pods without Istio sidecars but with istio-injection configured.
There is a pod POD_NAME in the NAMESPACE namespace without Istio sidecars, but with istio-injection configured.

To identify the affected pods, run the following command:
d8 k -n NAMESPACE get pods -l '!service.istio.io/canonical-name' -o json | jq -r '.items[] | select(.metadata.annotations."sidecar.istio.io/inject" != "false") | .metadata.name'
D8IstioRemoteClusterNotSynced EE S4
Istio remote cluster CLUSTER_ID is not synchronized.
The Istio control plane instance INSTANCE_NAME cannot synchronize with the remote cluster CLUSTER_ID.

Possible causes:
- The remote cluster is offline.
- The remote API endpoint is not reachable.
- The remote ServiceAccount token is invalid or expired.
- There is a TLS or certificate issue between the clusters.
D8IstioVersionIsIncompatibleWithK8sVersion CE S3

The installed Istio version is incompatible with the Kubernetes version.

The installed Istio version VERSION_NUMBER may not work properly with the current Kubernetes version VERSION_NUMBER because it’s not supported officially.

To resolve the issue, upgrade Istio following the guide.
IstioIrrelevantExternalServiceFound CE S5

External service found with irrelevant ports specifications.

A service NAME in the NAMESPACE namespace has an irrelevant port specification.

The .spec.ports[] field isn’t applicable for services of the ExternalName type. However, Istio renders port listeners for external services as 0.0.0.0:port, which captures all traffic to the specified port. This can cause problems for services that aren’t registered in the Istio registry.

To resolve the issue, remove the .spec.ports section from the service configuration. It is safe.

Модуль kube-dns

KubernetesCoreDNSHasCriticalErrors CE S5
Critical errors found in CoreDNS.
Deckhouse has detected at least one critical error in the CoreDNS pod POD_NAME.

To resolve the issue, review the container logs:
d8 k -n kube-system logs POD_NAME

Модуль log-shipper

D8LogShipperAgentNotScheduledInCluster CE S7

The log-shipper-agent pods can't be scheduled in the cluster.

Deckhouse has detected that a number of log-shipper-agent pods are not scheduled.

To resolve this issue, do the following:

Check the state of the d8-log-shipper/log-shipper-agent DaemonSet:

d8 k -n d8-log-shipper get daemonsets --selector=app=log-shipper

Check the state of the d8-log-shipper/log-shipper-agent pods:

d8 k -n d8-log-shipper get pods --selector=app=log-shipper-agent

If you know where the DaemonSet should be scheduled, run the following command to identify the problematic nodes:

d8 k -n d8-log-shipper get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="log-shipper-agent")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'

D8LogShipperClusterLogDestinationD8LokiAuthorizationRequired CE S9

Authorization parameters required for the ClusterLogDestination resource.

Deckhouse has detected the ClusterLogDestination resource RESOURCE_NAME without authorization parameters.

Add the authorization parameters to the ClusterLogDestination resource following the instructions.
D8LogShipperCollectLogErrors CE S4
The log-shipper-agent pods can't collect logs to COMPONENT_ID on the NODE_NAME node.
Deckhouse has detected that the HOST_NAME log-shipper-agent on the NODE_NAME node has failed to collect metrics for more than 10 minutes.

This is caused by the ERROR_TYPE errors occurred during the STAGE_NAME stage while reading COMPONENT_TYPE.

To resolve this, check the pod logs or follow advanced instructions:
d8 k -n d8-log-shipper logs HOST_NAME` -c vector
D8LogShipperDestinationErrors CE S4
The log-shipper-agent pods can't send logs to COMPONENT_ID on the NODE_NAME node.
Deckhouse has detected that the HOST_NAME log-shipper-agent on the NODE_NAME node has failed to send a log for more than 10 minutes.

This is caused by the ERROR_TYPE errors occurred during the STAGE_NAME stage while sending logs to COMPONENT_TYPE.

To resolve this, check the pod logs or follow advanced instructions:
d8 k -n d8-log-shipper logs HOST_NAME -c vector
D8LogShipperLogsDroppedByRateLimit CE S4
The log-shipper-agent pods are dropping logs to COMPONENT_ID on the NODE_NAME node.
Rate-limiting rules have been applied, and the log-shipper-agent on the NODE_NAME node has been dropping logs for more than 10 minutes.

To resolve this, check the pod logs or follow advanced instructions:
d8 k -n d8-log-shipper get pods -o wide | grep NODE_NAME

Модуль loki

LokiInsufficientDiskForRetention CE S4

Not enough disk space to retain logs for 168 hours

Not enough disk space to retain logs for 168 hours. Current effective retention period is VALUE hours.

You need either decrease expected retentionPeriodHours or increase resize Loki PersistentVolumeClaim

Модуль metallb

D8MetallbBGPSessionDown EE S4
MetalLB BGP session is down.
JOB_NAME, MetalLB CONTAINER_NAME on POD_NAME has BGP session PEER down.

Check the logs for details:
d8 k -n d8-metallb logs daemonset/speaker -c speaker
D8MetallbConfigNotLoaded EE S4
The MetalLB configuration hasn't been loaded.
JOB_NAME, MetalLB CONTAINER_NAME on POD_NAME hasn’t been loaded.

To find the cause of the issue, review the controller logs:
d8 k -n d8-metallb logs deploy/controller -c controller
D8MetallbConfigStale EE S4
MetalLB is running on a stale configuration.
JOB_NAME, MetalLB CONTAINER_NAME on POD_NAME is running on a stale configuration because the latest configuration failed to load.

To find the cause of the issue, review the controller logs:
d8 k -n d8-metallb logs deploy/controller -c controller
D8MetallbNotSupportedServiceAnnotationsDetected SE S4
The annotation 'ANNOTATION_NAME' has been deprecated for the service 'NAME' in the 'NAMESPACE' namespace.
The annotation ‘ANNOTATION_NAME’ has been deprecated for the service ‘NAME’ in the ‘NAMESPACE’ namespace.

The following service annotations are no longer effective:
- metallb.universe.tf/ip-allocated-from-pool: Remove this annotation.
- metallb.universe.tf/address-pool: Replace it with the .spec.loadBalancerClass parameter or use the network.deckhouse.io/metal-load-balancer-class annotation, referencing the appropriate MetalLoadBalancerClass.
- metallb.universe.tf/loadBalancerIPs: Replace it with network.deckhouse.io/load-balancer-ips: IP.
- metallb.universe.tf/allow-shared-ip: Replace it with network.deckhouse.io/load-balancer-shared-ip-key.
Please note. Existing LoadBalancer services of Deckhouse have been migrated automatically, but the new ones will not be.
D8MetallbObsoleteLayer2PoolsAreUsed SE S7

The metallb module has obsolete layer2 pools configured.

In ModuleConfig version 2, addressPool ‘NAME’ of type “layer2” are ignored. They should be removed from the configuration.
D8MetallbUpdateMCVersionRequired SE S5

The metallb ModuleConfig settings are outdated.

D8 MetalLB settings are outdated.

To resolve this issue, increase version in the ModuleConfig metallb.
L2LoadBalancerOrphanServiceFound SE S4

Orphaned service with an irrelevant L2LoadBalancer name has been found.

The cluster contains an orphaned service NAME in the NAMESPACE namespace with an irrelevant L2LoadBalancer name.

To resolve this issue, verify the L2LoadBalancer name specified in the annotation network.deckhouse.io/l2-load-balancer-name.

Модуль monitoring-applications

D8OldPrometheusTargetFormat FE S6
Services with the prometheus-target label are being used for metric collection.
Deckhouse has detected that services with the prometheus-target label are being used to collect metrics in the cluster.

The label format has been changed. To resolve the issue, replace the prometheus-target label with prometheus.deckhouse.io/target.

To list all services labeled with prometheus-target, run the following command:
d8 k get service --all-namespaces --show-labels | grep prometheus-target

Модуль monitoring-custom

CustomPodMonitorFoundInCluster CE S9
Deckhouse namespace contains PodMonitors not created by Deckhouse.
There are PodMonitors in the Deckhouse namespace that were not created by Deckhouse.

To resolve the issue, move these PodMonitors to the user-spec namespace by removing the heritage: deckhouse label.

To list all PodMonitors in the Deckhouse namespace, run the following command:
d8 k get podmonitors --all-namespaces -l heritage!=deckhouse
For more information on metric collection, refer to the Prometheus module FAQ.
CustomServiceMonitorFoundInD8Namespace CE S9
Deckhouse namespace contains ServiceMonitors not created by Deckhouse.
There are ServiceMonitors in the Deckhouse namespace that were not created by Deckhouse.

To resolve the issue, move these ServiceMonitors to the user-spec namespace by removing the heritage: deckhouse label.

To list all ServiceMonitors in the Deckhouse namespace, run the following command:
d8 k get servicemonitors --all-namespaces -l heritage!=deckhouse
For more information on metric collection, refer to the Prometheus module FAQ.
D8CustomPrometheusRuleFoundInCluster CE S9
Deckhouse namespace contains PrometheusRules not created by Deckhouse.
There are PrometheusRules in the Deckhouse namespace that were not created by Deckhouse.

To resolve the issue, replace these PrometheusRules with the CustomPrometheusRules object.

To list all PrometheusRules in the Deckhouse namespace, run the following command:
d8 k get prometheusrules --all-namespaces -l heritage!=deckhouse
For details on adding alerts and recording rules, refer to the Prometheus module FAQ.
D8OldPrometheusCustomTargetFormat CE S9
Services with the prometheus-custom-target label are being used for metric collection.
Deckhouse has detected that services with the prometheus-custom-target label are being used to collect metrics in the cluster.

The label format has been changed. To resolve the issue, replace the prometheus-custom-target label with prometheus.deckhouse.io/custom-target.

To list all services labeled with prometheus-custom-target, run the following command:
d8 k get service --all-namespaces --show-labels | grep prometheus-custom-target
For more information on metric collection, refer to the Prometheus module FAQ.
D8ReservedNodeLabelOrTaintFound CE S6
Node NAME is using a reserved label or taint.
Deckhouse has detected that node NAME is using one of the following:
- A reserved metadata.labels object node-role.deckhouse.io/, which doesn’t end with (system|frontend|monitoring|_deckhouse_module_name_).
- A reserved spec.taints object dedicated.deckhouse.io, with a value other than (system|frontend|monitoring|_deckhouse_module_name_).
For instructions on how to resolve this issue, refer to the node allocation guide.

Модуль monitoring-deckhouse

D8CNIEnabledMoreThanOne CE S2

More than one CNI is enabled in the cluster.

Deckhouse has detected that multiple CNIs are enabled in the cluster. For the cluster to work correctly, only one CNI must be enabled.

To resolve this issue, disable any unnecessary CNI.
D8CNIMisconfigured CE S5
The parameters specified in ModuleConfig of CNI_NAME do not match the ones that are actually being used in the cluster.
This happened because there were several sources for configuring CNI parameters in the cluster, and ModuleConfig did not have the highest priority previously.

To resolve this issue, the following steps should be taken:
1. Find the ConfigMap d8-system/desired-cni-moduleconfig in the cluster, which contains the actual ModuleConfig settings.
  
  d8 k -n d8-system get configmap desired-cni-moduleconfig -o yaml
2. Apply this prepared ModuleConfig to the cluster. This will not cause any actual reconfigurations in the cluster and is completely safe.
3. Once the parameters in “ModuleConfig” match those used in the cluster:
  
  this alert will be resolved,
  
  and the priority of the sources for configuring CNI will change, and ModuleConfig will become the main source of truth.
4. After that, you can add the desired parameters to the ModuleConfig CNI and they will be applied immediately.
D8DeckhouseConfigInvalid CE S5
Deckhouse configuration is invalid.
Deckhouse configuration contains errors.

Steps to troubleshoot:
1. Check Deckhouse logs by running the following command:
  
  d8 k -n d8-system logs -f -l app=deckhouse
2. Edit the Deckhouse global configuration:
  
  d8 k edit mc global
  
  Or edit configuration of a specific module:
  
  d8 k edit mc MODULE_NAME
D8DeckhouseCouldNotDeleteModule CE S4
Deckhouse is unable to delete the MODULE_NAME module.
To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseCouldNotDiscoverModules CE S4
Deckhouse is unable to discover modules.
To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseCouldNotRunGlobalHook CE S5
Deckhouse is unable to run the HOOK_NAME global hook.
To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseCouldNotRunModule CE S4
Deckhouse is unable to start the MODULE_NAME module.
To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseCouldNotRunModuleHook CE S7
Deckhouse is unable to run the MODULE_NAME/HOOK_NAME module hook.
To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseCustomTargetDown CE S4

Prometheus is unable to scrape custom metrics generated by Deckhouse hooks.
D8DeckhouseDeprecatedConfigmapManagedByArgoCD CE S4

Deprecated Deckhouse ConfigMap managed by Argo CD.

The Deckhouse ConfigMap is no longer used.

To resolve this issue, remove the d8-system/deckhouse ConfigMap from Argo CD.
D8DeckhouseGlobalHookFailsTooOften CE S9
The HOOK_NAME Deckhouse global hook is crashing too frequently.
The HOOK_NAME hook has failed multiple times in the last __SCRAPE_INTERVAL_X_4__.

To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseHasNoAccessToRegistry CE S7

Deckhouse is unable to connect to the registry.

Deckhouse can’t connect to the registry (typically registry.deckhouse.io) to check for a new Docker image. These checks are performed every 15 seconds. Without access to the registry, automatic updates are unavailable.

This alert often indicates that the Deckhouse Pod is experiencing connectivity issues with the Internet.
D8DeckhouseIsHung CE S4

Deckhouse is down.

Deckhouse is probably down, since the deckhouse_live_ticks metric in Prometheus has stopped increasing. This metric is expected to increment every 10 seconds.
D8DeckhouseIsNotOnReleaseChannel CE S9
Deckhouse isn't subscribed to any regular release channels.
Deckhouse in this cluster isn’t subscribed to any of the regular release channels: Alpha, Beta, EarlyAccess, Stable, or RockSolid.

To resolve this issue, follow these steps:
1. Check the current release channel used in the cluster:
  
  d8 k -n d8-system get deploy deckhouse -o json | jq '.spec.template.spec.containers[0].image' -r
2. Subscribe to one of the regular release channels by adjusting the deckhouse module configuration.
D8DeckhouseModuleHookFailsTooOften CE S9
The MODULE_NAME/HOOK_NAME Deckhouse hook is crashing too frequently.
The HOOK_NAME hook of the MODULE_NAME module has failed multiple times in the last __SCRAPE_INTERVAL_X_4__.

To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseModuleUpdatePolicyNotFound CE S5
Module update policy not found for MODULE_RELEASE.
The module update policy for MODULE_RELEASE is missing.

To resolve this issue, remove the label from the module release using the following command:
d8 k label mr MODULE_RELEASE modules.deckhouse.io/update-policy-
A new suitable policy will be detected automatically.
D8DeckhouseModuleValidationError CE S5
Module configuration failed for module MODULE_NAME.
Initial config for module MODULE_NAME is not valid.

You can get more details via
d8 k get mr -l module=MODULE_NAME
Provided error: XXX
D8DeckhousePodIsNotReady CE S4

The Deckhouse Pod is NOT Ready.
D8DeckhousePodIsNotRunning CE S4

The Deckhouse Pod is NOT Running.
D8DeckhousePodIsRestartingTooOften CE S9
Excessive Deckhouse restarts detected.
Number of restarts in the last hour: VALUE.

Excessive Deckhouse restarts indicate a potential issue. Normally, Deckhouse should be up and running continuously.

To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseQueueIsHung CE S7
The QUEUE_NAME Deckhouse queue is stuck with VALUE pending task(s).
Deckhouse cannot finish processing of the QUEUE_NAME queue, which currently has VALUE pending task(s).

To investigate the issue, check the logs by running the following command:
d8 k -n d8-system logs -f -l app=deckhouse
D8DeckhouseSelfTargetAbsent CE S4

There is no Deckhouse target in Prometheus.
D8DeckhouseSelfTargetDown CE S4

Prometheus is unable to scrape Deckhouse metrics.
D8DeckhouseWatchErrorOccurred CE S5
Possible API server connection error in the client-go informer.
Deckhouse has detected an error in the client-go informer, possibly due to connection issues with the API server.

Steps to investigate:
1. Check Deckhouse logs for more information by running:
  
  d8 k -n d8-system logs deploy/deckhouse | grep error | grep -i watch
2. This alert attempts to detect a correlation between the faulty snapshot invalidation and API server connection errors, specifically for the handle-node-template hook in the node-manager module.
  
  To compare the snapshot with the actual node objects for this hook, run the following command:
  
  diff -u <(d8 k get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'|sort) <(d8 k -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller module snapshots node-manager -o json | jq '."040-node-manager/hooks/handle_node_templates.go"' | jq '.nodes.snapshot[] | .filterResult.Name' -r | sort)
D8HasModuleConfigAllowedToDisable CE S4
The ModuleConfig annotation for allowing module disabling is set.
The ModuleConfig is pending module disabling.

It is recommended that you keep your module configurations clean by removing unnecessary approval annotations.

If you ignore this alert and do not clear the annotation, the module may be accidentally removed from the cluster, potentially leading to irreversible consequences.

To resolve this issue and stop the alert, run the following command:
d8 k annotate moduleconfig MODULE_NAME modules.deckhouse.io/allow-disabling-
D8ModuleOutdatedByMajorVersion CE S9
Module MODULE_NAME is running XXX major versions behind.
Module MODULE_NAME is running XXX major versions behind on major version releases. Major version updates may contain breaking changes and require immediate attention and careful planning.

To check the module releases status, run the following command:
d8 k get mr -l module=MODULE_NAME ```XXXXXX This major release has from-to version constraints that must be met before applying the update. To check the update instructions (from-to versions), run the following command: ```bash d8 k get mr NAME -o jsonpath='{.spec.update.versions[*]}'
This will show you the update paths including the from-to versions defined in the release.XXX

To approve the module major release update, run the following command:

bash d8 k annotate mr NAME modules.deckhouse.io/approved="true"XXX
D8NodeHasDeprecatedOSVersion CE S4
Nodes with deprecated OS versions detected.
Deckhouse has detected nodes running deprecated OS versions.

Steps to troubleshoot:
1. Get a list of affected nodes by running the following Prometheus query:
  
  kube_node_info{os_image=~"Ubuntu 18.04.*|Debian GNU/Linux 10.*|CentOS Linux 7.*"}
2. Update the affected nodes to a supported OS version.
DeckhouseEditionNotFound CE S6
Incorrect Deckhouse Kubernetes Platform edition.
Specify a Deckhouse Kubernetes Platform edition in MODULE_NAME ModuleConfig.
d8 k patch moduleconfig MODULE_NAME --type merge --patch '{"spec": {"settings": {"license": {"edition": "DKP_EDITION"}}}}'
Where DKP_EDITION is the edition you want to use: CE, BE, EE, SE, or SE-plus.

DeckhouseHighMemoryUsage CE S3

Deckhouse memory usage is too high.

Deckhouse pod POD_NAME on node NODE_NAME has high memory usage.

Please, if possible, get the dumps for debugging purposes (it will take around 30 seconds):

curl -s "http://$(d8 k -n d8-system get pod POD_NAME -o jsonpath='{.status.hostIP}'):4222/debug/pprof/heap" -o /tmp/heap.pprof
curl -s "http://$(d8 k -n d8-system get pod POD_NAME -o jsonpath='{.status.hostIP}'):4222/debug/pprof/goroutine" -o /tmp/goroutine.pprof
curl -s "http://$(d8 k -n d8-system get pod POD_NAME -o jsonpath='{.status.hostIP}'):4222/debug/pprof/profile?seconds=30" -o /tmp/cpu.pprof

and send these files (/tmp/heap.pprof, /tmp/cpu.pprof and /tmp/goroutine.pprof) to the support team.

DeckhouseMigratedModuleNotFound CE S6
Migrated module MODULE_NAME not found in registry.
Module MODULE_NAME specified in DeckhouseRelease was moved to external source and not found in any available ModuleSource registry. DeckhouseRelease will be lock until found module in any available ModuleSource.

This may indicate:
- The module source registry is not properly configured.
- The module has not been published to the registry yet.
- There is a network issue preventing access to the registry.
Please check:
1. ModuleSource configurations: d8 k get modulesources.
2. Registry accessibility and credentials.
3. Module availability in the configured registries.
To investigate, run the following commands:
# Check ModuleSource status d8 k get modulesources # Check available modules in all sources d8 k -n d8-system exec -it deployment/deckhouse -c deckhouse -- deckhouse-controller registry get sources d8 k -n d8-system exec -it deployment/deckhouse -c deckhouse -- deckhouse-controller registry get modules source_name # Check Deckhouse logs for more details d8 k -n d8-system logs -f -l app=deckhouse
DeckhouseModuleUpdatingFailedBrokenSequence CE S4
Module update has failed. Updating sequence is broken.
The ‘MODULE_NAME’ Module update has failed.

Current version: XXX. Desired version: VERSION_NUMBER.

Attempt to download interim releases failed.

Possible reasons:
- The necessary versions of the Module image is not available in the registry.
To resolve this issue, ensure that all minor versions of the ‘MODULE_NAME’ Module image between XXX and VERSION_NUMBER is available in the registry ‘XXX’.
DeckhouseModuleUpdatingFailedModuleIsNotValid CE S4
Module update has failed. Module is not valid.
The ‘MODULE_NAME’ Module update has failed.

Desired version: VERSION_NUMBER.

Possible reasons:
- The Module image is corrupted.
To resolve this issue, ensure that VERSION_NUMBER image in the registry ‘XXX’ is valid .
DeckhouseReleaseDisruptionApprovalRequired CE S4
Deckhouse release disruption approval required.
The new Deckhouse release includes a disruptive update that requires manual approval.

To check the details, run the following command:
d8 k describe DeckhouseRelease NAME
To approve the disruptive update, run the following command:
d8 k annotate DeckhouseRelease NAME release.deckhouse.io/disruption-approved=true
DeckhouseReleaseIsBlocked CE S5
Deckhouse release requirements haven't been met.
The requirements for the Deckhouse release haven’t been met.

To check the details, run the following command:
d8 k describe DeckhouseRelease NAME
DeckhouseReleaseIsWaitingManualApproval CE S3
A new Deckhouse release is awaiting manual approval.
A new Deckhouse release is available but requires manual approval before it can be applied.

To approve the release, run the following command:
d8 k patch DeckhouseRelease NAME --type=merge -p='{"approved": true}'
DeckhouseReleaseIsWaitingManualApproval CE S6
A new Deckhouse release is awaiting manual approval.
A new Deckhouse release is available but requires manual approval before it can be applied.

To approve the release, run the following command:
d8 k patch DeckhouseRelease NAME --type=merge -p='{"approved": true}'
DeckhouseReleaseIsWaitingManualApproval CE S9
A new Deckhouse release is awaiting manual approval.
A new Deckhouse release is available but requires manual approval before it can be applied.

To approve the release, run the following command:
d8 k patch DeckhouseRelease NAME --type=merge -p='{"approved": true}'
DeckhouseReleaseNotificationNotSent CE S4
Deckhouse release notification webhook hasn't been sent.
The Deckhouse release notification webhook failed to send.

To check the notification webhook address, run the following command:
d8 k get mc deckhouse -o yaml
DeckhouseUpdating CE S4

Deckhouse is being updated to XXX.
DeckhouseUpdatingFailed CE S4
Deckhouse update has failed.
The Deckhouse update has failed.

Possible reasons:
- The next minor or patch version of the Deckhouse image is not available in the registry.
- The Deckhouse image is corrupted.
Current version: VERSION_NUMBER.

To resolve this issue, ensure that the next version of the Deckhouse image is available in the registry.
ModuleAtConflict CE S4

Conflict detected for module MODULE_NAME.

Deckhouse has detected conflicting sources for the MODULE_NAME module.

To resolve this issue, specify the correct source in the module configuration.
ModuleConfigObsoleteVersion CE S4

ModuleConfig NAME is outdated.

Deckhouse has detected that ModuleConfig NAME is outdated.

To resolve this issue, update ModuleConfig NAME to the latest version.
ModuleHasDeprecatedUpdatePolicy CE S4
Module MODULE_NAME is matched by the deprecated update policy UPDATE_POLICY_NAME.
Deckhouse has detected that the module MODULE_NAME is using the deprecated update policy UPDATE_POLICY_NAME. The v1alpha1 policy has a selector that no longer functions.

To specify the update policy in the module configuration, run the following command:
d8 k patch moduleconfig MODULE_NAME --type='json' -p='[{"op": "add", "path": "/spec/updatePolicy", "value": "UPDATE_POLICY_NAME"}]'
After resolving all alerts related to the update policy UPDATE_POLICY_NAME, clear the selector by running the following command:
d8 k patch moduleupdatepolicies.v1alpha1.deckhouse.io UPDATE_POLICY_NAME --type='json' -p='[{"op": "replace", "path": "/spec/moduleReleaseSelector/labelSelector/matchLabels", "value": {"": ""}}]'
ModuleIsDeprecated CE S9

Module MODULE_NAME is deprecated.

The module MODULE_NAME is deprecated and will not receive new updates. The module may be removed in the future, so please disable it in a controlled manner.
ModuleIsInMaintenanceMode CE S6
Module MODULE_NAME is running in maintenance mode.
Module MODULE_NAME is running in maintenance mode. In this mode, its state is not reconciled, which prevents configuration or hook changes from being applied automatically.

To switch the module back to normal mode, edit the corresponding ModuleConfig resource with the following command:
d8 k patch moduleconfig MODULE_NAME --type=json -p='[{"op": "remove", "path": "/spec/maintenance"}]'
ModuleReleaseIsBlockedByRequirements CE S6
A new release for module MODULE_NAME has been blocked due to unmet requirements.
A new release for module MODULE_NAME has been blocked because it doesn’t meet the required conditions.

To check the requirements, run the following command:
d8 k get mr NAME -o json | jq .spec.requirements
ModuleReleaseIsOutdated CE S5
Module MODULE_NAME is XXX or more minor versions behind the latest available release.
Module MODULE_NAME is running XXX or more minor versions behind the latest available release. This module is significantly outdated and may have compatibility issues.

To approve the module release update, run the following command:
d8 k annotate mr NAME modules.deckhouse.io/approved="true"
ModuleReleaseIsOutdated CE S7
Module MODULE_NAME is 2 minor versions behind the latest available release.
Module MODULE_NAME is running 2 minor versions behind the latest available release. Updating is recommended to maintain compatibility and access to latest features.

To approve the module release update, run the following command:
d8 k annotate mr NAME modules.deckhouse.io/approved="true"
ModuleReleaseIsOutdated CE S9
Module MODULE_NAME is 1 minor version behind the latest available release.
Module MODULE_NAME is running 1 minor version behind the latest available release. Consider updating to maintain compatibility and security.

To approve the module release update, run the following command:
d8 k annotate mr NAME modules.deckhouse.io/approved="true"
ModuleReleaseIsWaitingManualApproval CE S6
A new release for module MODULE_NAME is awaiting manual approval.
A new release for module MODULE_NAME is available but requires manual approval before it can be applied.

To approve the module release, run the following command:
d8 k annotate mr NAME modules.deckhouse.io/approved="true"

Модуль monitoring-kubernetes

CPUStealHigh CE S4

CPU steal on node NODE_NAME is too high.

The CPU steal has been excessively high on node NODE_NAME over the past 30 minutes.

Another component, such as a neighboring virtual machine, may be consuming the node’s resources. This may be the result of “overselling” the hypervisor, meaning the hypervisor is hosting more virtual machines than it can handle.
DeadMansSwitch CE S4

Alerting dead man's switch.

This is a dead man’s switch meant to ensure that the entire Alerting pipeline is functional.
DeploymentGenerationMismatch CE S4

Deployment is outdated.

The observed deployment generation doesn’t match the expected one for Deployment NAMESPACE/DEPLOYMENT_NAME.
EbpfExporterKernelNotSupported CE S8
The BTF module required for ebpf_exporter is missing from the kernel.
Possible options to resolve the issue:
- Build the kernel with BTF type information.
- Disable ebpf_exporter.
FdExhaustionClose CE S3

File descriptors for JOB_NAME: INSTANCE_NAME are almost exhausted.

The instance JOB_NAME: INSTANCE_NAME is expected to exhaust its available file/socket descriptors within the next hour.
FdExhaustionClose CE S3

File descriptors for JOB_NAME: NAMESPACE/POD_NAME are almost exhausted.

The instance JOB_NAME: NAMESPACE/POD_NAME is expected to exhaust its available file/socket descriptors within the next hour.
FdExhaustionClose CE S4

File descriptors for JOB_NAME: INSTANCE_NAME are exhausting soon.

The instance JOB_NAME: INSTANCE_NAME is expected to exhaust its available file/socket descriptors within the next 4 hours.
FdExhaustionClose CE S4

File descriptors for JOB_NAME: NAMESPACE/POD_NAME are exhausting soon.

The instance JOB_NAME: NAMESPACE/POD_NAME is expected to exhaust its available file/socket descriptors within the next 4 hours.
HelmReleasesHasResourcesWithDeprecatedVersions CE S5
At least one Helm release contains resources with deprecated apiVersion, which will be removed in Kubernetes version VERSION_NUMBER.
To list all affected resources, run the following Prometheus query:
max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 1
For more details on upgrading deprecated resources, refer to the Kubernetes deprecation guide available at https://kubernetes.io/docs/reference/using-api/deprecation-guide/#vXXX.

Note that the check runs once per hour, so this alert should resolve within an hour after migrating deprecated resources.
HelmReleasesHasResourcesWithUnsupportedVersions CE S4
At least one Helm release contains resources with unsupported apiVersion for Kubernetes version VERSION_NUMBER.
To list all affected resources, run the following Prometheus query:
max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 2
For more details on migrating deprecated resources, refer to the Kubernetes deprecation guide available at https://kubernetes.io/docs/reference/using-api/deprecation-guide/#vXXX.

Note that the check runs once per hour, so this alert should resolve within an hour after migrating deprecated resources.
K8SKubeletDown CE S3

Multiple kubelets couldn't be scraped.

Prometheus failed to scrape VALUE% of kubelets.
K8SKubeletDown CE S4

Several kubelets couldn't be scraped.

Prometheus failed to scrape VALUE% of kubelets.
K8SKubeletTooManyPods CE S7

The kubelet on node NODE_NAME is approaching the pod limit.

The kubelet on node NODE_NAME is running VALUE pods, which is close to the limit of XXX.
K8SManyNodesNotReady CE S3

Too many nodes are not ready.

VALUE% of Kubernetes nodes are not ready.
K8SNodeNotReady CE S3

The status of node NODE_NAME is NotReady.

The kubelet on node NODE_NAME has either failed to check in with the API server or has set itself to NotReady for more than 10 minutes.
KubeletImageFSBytesUsage CE S5

No free space remaining on imagefs on node NODE_NAME at mountpoint MOUNTPOINT.
KubeletImageFSBytesUsage CE S6

Hard eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Hard eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletImageFSBytesUsage CE S7

Approaching hard eviction threshold of imagefs on node NODE_NAME at mountpoint MOUNTPOINT.

The imagefs usage on node NODE_NAME at mountpoint MOUNTPOINT is nearing the hard eviction threshold.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletImageFSBytesUsage CE S9

Soft eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Soft eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Soft eviction threshold: XXX%

Current usage: VALUE%
KubeletImageFSInodesUsage CE S5

No free inodes remaining on imagefs on node NODE_NAME at mountpoint MOUNTPOINT.
KubeletImageFSInodesUsage CE S6

Hard eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Hard eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletImageFSInodesUsage CE S7

Approaching hard eviction threshold of imagefs on node NODE_NAME at mountpoint MOUNTPOINT.

The imagefs usage on node NODE_NAME at mountpoint MOUNTPOINT is nearing the hard eviction threshold.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletImageFSInodesUsage CE S9

Soft eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Soft eviction of imagefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Soft eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSBytesUsage CE S5

No free space remaining on nodefs on node NODE_NAME at mountpoint MOUNTPOINT.
KubeletNodeFSBytesUsage CE S6

Hard eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Hard eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSBytesUsage CE S7

Approaching hard eviction threshold of nodefs on node NODE_NAME at mountpoint MOUNTPOINT.

The nodefs usage on node NODE_NAME at mountpoint MOUNTPOINT is nearing the hard eviction threshold.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSBytesUsage CE S9

Soft eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Soft eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Soft eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSInodesUsage CE S5

No free inodes remaining on nodefs on node NODE_NAME at mountpoint MOUNTPOINT.
KubeletNodeFSInodesUsage CE S6

Hard eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Hard eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSInodesUsage CE S7

Approaching hard eviction threshold of nodefs on node NODE_NAME at mountpoint MOUNTPOINT.

The nodefs usage on node NODE_NAME at mountpoint MOUNTPOINT is nearing the hard eviction threshold.

Hard eviction threshold: XXX%

Current usage: VALUE%
KubeletNodeFSInodesUsage CE S9

Soft eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is in progress.

Soft eviction of nodefs on node NODE_NAME at mountpoint MOUNTPOINT is currently in progress.

Soft eviction threshold: XXX%

Current usage: VALUE%
KubernetesDnsTargetDown CE S5
Kube-dns or CoreDNS are not being monitored.
Prometheus is unable to collect metrics from kube-dns, which makes its status unknown.

Steps to troubleshoot:
1. Check the deployment details:
  
  d8 k -n kube-system describe deployment -l k8s-app=kube-dns
2. Check the pod details:
  
  d8 k -n kube-system describe pod -l k8s-app=kube-dns
KubeStateMetricsDown CE S3
Kube-state-metrics isn't working in the cluster.
Deckhouse has detected that no metrics about cluster resources have been available for the past 5 minutes. As a result, most alerts and monitoring panels aren’t working.

Steps to troubleshoot:
1. Check the kube-state-metrics pods:
  
  d8 k -n d8-monitoring describe pod -l app=kube-state-metrics
2. Check the deployment logs:
  
  d8 k -n d8-monitoring describe deploy kube-state-metrics
LoadBalancerServiceWithoutExternalIP CE S4
A LoadBalancer has not been created.
One or more services of the LoadBalancer type have not received an external address.

To list affected services, run the following command:
d8 k get svc -Ao json | jq -r '.items[] | select(.spec.type == "LoadBalancer") | select(.status.loadBalancer.ingress[0].ip == null) | "namespace: \(.metadata.namespace), name: \(.metadata.name), ip: \(.status.loadBalancer.ingress[0].ip)"'
Steps to troubleshoot:
- Check the cloud-controller-manager logs in the d8-cloud-provider-* namespace.
- If you are using a bare-metal cluster with the metallb module enabled, ensure the address pool has not been exhausted.
NodeConntrackTableFull CE S3

The conntrack table on node NODE_NAME is full.

As a result, no new connections created or accepted on the node. Keeping the conntrack table at full capacity may lead to erratic software behavior.

To identify the source of excessive conntrack entries, use Okmeter or Grafana dashboards.
NodeConntrackTableFull CE S4

The conntrack table on node NODE_NAME is approaching the size limit.

The conntrack table on node NODE_NAME is currently at VALUE% of its maximum capacity.

This is acceptable as long as the conntrack table remains 70-80% full. However, if it reaches full capacity, new connections may fail, causing network disruptions and erratic software behavior.

To identify the source of excessive conntrack entries, use Okmeter or Grafana dashboards.
NodeExporterDown CE S3

Prometheus couldn't scrape a node-exporter.

Prometheus has been unable to scrape a node-exporter for more than 10 minutes, or node-exporters have disappeared from service discovery.
NodeFilesystemIsRO CE S4

Node file system is read-only.

The file system on the node has been switched to read-only mode.

To investigate the cause, check the node logs.
NodeSUnreclaimBytesUsageHigh CE S4
Node NODE_NAME has high kernel memory usage.
The node NODE_NAME has a potential kernel memory leak. One known issue could be causing this problem.

Steps to troubleshoot:
1. Check the cgroupDriver setting on node NODE_NAME:
  
  cat /var/lib/kubelet/config.yaml | grep 'cgroupDriver: systemd'
2. If cgroupDriver is set to systemd, a reboot is required to switch back to the cgroupfs driver. In this case, drain and reboot the node.
For further details, refer to the issue.
NodeSystemExporterDoesNotExistsForNode CE S4
Some system exporters aren't functioning correctly on node NODE_NAME.
To resolve the issue, follow these steps:
1. Find the node-exporter Pod running on the affected node:
  
  d8 k -n d8-monitoring get pod -l app=node-exporter -o json | jq -r ".items[] | select(.spec.nodeName==\"NODE_NAME\") | .metadata.name"
2. Describe the node-exporter Pod:
  
  d8 k -n d8-monitoring describe pod POD_NAME
3. Verify that the kubelet is running on node NODE_NAME.
NodeTCPMemoryExhaust CE S6
Node NODE_NAME has high TCP stack memory usage.
The TCP stack on node NODE_NAME is experiencing high memory pressure. This may be caused by applications with intensive TCP networking usage.

Steps to troubleshoot:
1. Identify applications consuming excessive TCP memory.
2. Adjust TCP memory configuration if needed.
3. Investigate network traffic sources.
NodeUnschedulable CE S8
Node NODE_NAME is cordon-protected, preventing new pods from being scheduled.
The node NODE_NAME is in a cordon-protected state, meaning no new pods can be scheduled onto it.

Someone may have executed one of the following commands on this node:
- To cordon the node:
  
  d8 k cordon NODE_NAME
- To drain the node (if draining has been running for more than 20 minutes):
  
  d8 k drain NODE_NAME
The was likely caused by the scheduled maintenance of that node.

PodStatusIsIncorrect CE S6

Incorrect state of Pod NAMESPACE/POD_NAME running on node NODE_NAME.

The Pod NAMESPACE/POD_NAME running on node NODE_NAME is listed as NotReady while all the Pod’s containers are Ready.

This could have been caused by the known Kubernetes bug.

Steps to troubleshoot:

Identify all pods with this state:

d8 k get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | "\(.spec.nodeName)/\(.metadata.namespace)/\(.metadata.name)"'

Identify all affected nodes:

d8 k get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | .spec.nodeName' -r | sort | uniq -c

Restart kubelet on each node:
```
systemctl restart kubelet
```

StorageClassCloudManual CE S6

Manually deployed StorageClass NAME found in the cluster.

A StorageClass using a cloud-provider provisioner shouldn’t be deployed manually. Such StorageClasses are managed by the cloud-provider module.

Instead of the manual deployment, modify the cloud-provider module configuration as needed.
StorageClassDefaultDuplicate CE S6

Multiple default StorageClasses found in the cluster.

Deckhouse has detected that more than one StorageClass in the cluster is annotated as default.

This may have been caused by a manually deployed StorageClass that is overlapping with the default storage configuration provided by the cloud-provider module.
UnsupportedContainerRuntimeVersion CE
Unsupported version of CRI Containerd X.XX.XXX installed for Kubernetes version VERSION_NUMBER.
Deckhouse has detected that the node NODE_NAME is running an unsupported version of CRI Containerd X.XX.XXX.

Supported CRI versions for Kubernetes VERSION_NUMBER:
- Containerd 1.4.
- Containerd 1.5.
- Containerd 1.6.
- Containerd 1.7.

Модуль monitoring-kubernetes-control-plane

K8SApiserverDown CE S3

API servers can't be reached.

No API servers are reachable, or they have all disappeared from service discovery.
K8sCertificateExpiration CE S5
Kubernetes has API clients with soon-to-expire certificates.
Some clients are connecting to COMPONENT_NAME with certificates that will expire in less than a day on node COMPONENT_NAME.

To check control plane certificates, use kubeadm:
1. Install kubeadm using the following command:
  
  apt install kubeadm=1.24.*
2. Check certificates:
  
  kubeadm alpha certs check-expiration
To check kubelet certificates, do the following on each node:
1. Check kubelet configuration:
  
  ps aux \ | grep "/usr/bin/kubelet" \ | grep -o -e "--kubeconfig=\S*" \ | cut -f2 -d"=" \ | xargs cat
2. Locate the client-certificate or client-certificate-data field.
3. Check certificate expiration using OpenSSL.
Note that there are no tools to find other stale kubeconfig files. Consider enabling the control-plane-manager module for advanced debugging.
K8sCertificateExpiration CE S6
Kubernetes has API clients with soon-to-expire certificates.
Some clients are connecting to COMPONENT_NAME with certificates that will expire in less than 7 days on node NODE_NAME.

To check control plane certificates, use kubeadm:
1. Install kubeadm using the following command:
  
  apt install kubeadm=1.24.*
2. Check certificates:
  
  kubeadm alpha certs check-expiration
To check kubelet certificates, do the following on each node:
1. Check kubelet configuration:
  
  ps aux \ | grep "/usr/bin/kubelet" \ | grep -o -e "--kubeconfig=\S*" \ | cut -f2 -d"=" \ | xargs cat
2. Locate the client-certificate or client-certificate-data field.
3. Check certificate expiration using OpenSSL.
Note that there are no tools to find other stale kubeconfig files. Consider enabling the control-plane-manager module for advanced debugging.
K8SControllerManagerTargetDown CE S3

Controller manager is down.

There is no running kube-controller-manager. As a result, deployments and replication controllers are not progressing.
K8SSchedulerTargetDown CE S3

Scheduler is down.

The Kubernetes scheduler is not running. As a result, new pods are not being assigned to nodes.
KubeEtcdHighFsyncDurations CE S7
Syncing (fsync) WAL files to disk is slow.
In the last 15 minutes, the 99th percentile of the fsync duration for WAL files exceeded 0.5 seconds: VALUE.

Possible causes:
- High disk latency where etcd data is stored.
- High CPU usage on the node.
KubeEtcdHighNumberOfLeaderChanges CE S5
The etcd cluster is re-electing the leader too frequently.
There have been VALUE leader re-elections for the etcd cluster member running on node NODE_NAME in the last 10 minutes.

Possible causes:
- High disk latency where etcd data is stored.
- High CPU usage on the node.
- Degradation of network connectivity between cluster members in the multi-master mode.
KubeEtcdInsufficientMembers CE S4

Insufficient members in the etcd cluster.

The etcd cluster has too few members, increasing the risk of failure if another member becomes unavailable. To resolve this issue, check the status of etcd Pods: bash d8 k -n kube-system get pod -l component=etcd
KubeEtcdNoLeader CE S4

The etcd cluster member running on node NODE_NAME has lost the leader.

To resolve this issue, check the status of the etcd Pods: bash d8 k -n kube-system get pod -l component=etcd | grep NODE_NAME
KubeEtcdTargetAbsent CE S5
There is no etcd target in Prometheus.
Steps to troubleshoot:
1. Check the status of the etcd Pods:
  
  d8 k -n kube-system get pod -l component=etcd
2. Review Prometheus logs:
  
  d8 k -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus
KubeEtcdTargetDown CE S5
Prometheus is unable to scrape etcd metrics.
Steps to troubleshoot:
1. Check the status of the etcd Pods:
  
  d8 k -n kube-system get pod -l component=etcd
2. Review Prometheus logs:
  
  d8 k -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus

Модуль monitoring-ping

NodePingPacketLoss CE S4

Ping loss exceeds 5%.

ICMP packet loss to node NODE_NAME has exceeded 5%.

Модуль node-local-dns

D8NodeLocalDNSCacheDenialEvictionsHigh BE S6

node-local-dns denial cache is being frequently evicted.

The denial (NXDOMAIN) cache in node-local-dns is evicting entries frequently.

This might indicate that there are Services that are being recreated too often in cluster. So, denial cache size is not enough (denial 9984 is currently configured).

Affected node: NODE_NAME
D8NodeLocalDNSCacheSuccessEvictionsHigh BE S6

node-local-dns success cache is being frequently evicted.

The success cache in node-local-dns is evicting entries frequently (>100 in 5 minutes).

This could mean there are too many different DNS records application have to resolve (to many Services?). So, node-local-dns cache capacity is too small (success 39936 is currently configured), or traffic patterns changed.

Affected node: NODE_NAME
D8NodeLocalDNSForwardMaxConcurrentRejects BE S5

The forward plugin is rejecting DNS queries due to max concurrency limit.

More than 20 queries were rejected within the last 5 minutes on node NODE_NAME because the max concurrent query limit was reached.
D8NodeLocalDNSKubeforwardRequestLatencyP95High BE S4

High DNS request latency in kubeforward (p95 > 300ms).

The 95th percentile DNS request latency for A-records through kubeforward on node NODE_NAME exceeded 300 milliseconds over the last 5 minutes. This may indicate slow upstreams, network issues, or DNS overload on this node.
D8NodeLocalDNSUpstreamHealthcheckFailed BE S5

Too many upstream health check failures in forward plugin.

The health check system of the forward plugin failed more than 10 times within 5 minutes on node NODE_NAME. As a result, all upstreams were considered unhealthy.

Модуль node-manager

CapsInstanceUnavailable CE S8
There are unavailable instances in the MACHINE_DEPLOYMENT_NAME MachineDeployment.
The MachineDeployment MACHINE_DEPLOYMENT_NAME has VALUE unavailable instances.

Check the status of the instances in the cluster with the following command:
d8 k get instance -l node.deckhouse.io/group=MACHINE_DEPLOYMENT_NAME
ClusterHasOrphanedDisks CE S6

Cloud data discoverer found orphaned disks in the cloud.

The cloud data discoverer has found disks in the cloud that do not have a corresponding PersistentVolume in the cluster.

You can safely delete these disks manually from your cloud provider:

ID: ID, Name: NAME
D8BashibleApiserverLocked CE S6
Bashible-apiserver has been locked for too long.
Bashible-apiserver has been locked for an extended period.

To resolve the issue, check if the bashible-apiserver Pods are up-to-date and running:
d8 k -n d8-cloud-instance-manager get pods -l app=bashible-apiserver
D8CloudDataDiscovererCloudRequestError CE S6
Cloud data discoverer cannot get data from the cloud.
Refer to the cloud data discoverer’s logs for details:
d8 k -n NAMESPACE logs deploy/cloud-data-discoverer
D8CloudDataDiscovererSaveError CE S6
Cloud data discoverer cannot save data to a Kubernetes resource.
Refer to the cloud data discoverer’s logs for details:
d8 k -n NAMESPACE logs deploy/cloud-data-discoverer
D8ClusterAutoscalerManagerPodIsNotReady CE S8

The POD_NAME Pod is NOT Ready.
D8ClusterAutoscalerPodIsNotRunning CE S8
The cluster-autoscaler Pod is NOT Running.
The POD_NAME Pod is STATUS.

To check the Pod’s status, run the following command:
d8 k -n NAMESPACE get pods POD_NAME -o json | jq .status
D8ClusterAutoscalerPodIsRestartingTooOften CE S9
Too many cluster-autoscaler restarts detected.
The cluster-autoscaler has restarted VALUE times in the past hour.

Frequent restarts may indicate a problem. The cluster-autoscaler is expected to run continuously without interruption.

Check the logs for details:
d8 k -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler
D8ClusterAutoscalerTargetAbsent CE S8
Cluster-autoscaler target is missing in Prometheus.
The cluster-autoscaler automatically scales nodes in the cluster. If it’s unavailable, it will be impossible to add new nodes when there’s not enough resources for scheduling Pods. It may also lead to unnecessary cloud costs due to unused but still provisioned cloud instances.

To resolve the issue, follow these steps:
1. Check availability and status of cluster-autoscaler Pods:
  
  d8 k -n d8-cloud-instance-manager get pods -l app=cluster-autoscaler
2. Verify that the cluster-autoscaler Deployment exists:
  
  d8 k -n d8-cloud-instance-manager get deploy cluster-autoscaler
3. Check the Deployment’s status:
  
  d8 k -n d8-cloud-instance-manager describe deploy cluster-autoscaler
D8ClusterAutoscalerTargetDown CE S8

Prometheus is unable to scrape cluster-autoscaler’s metrics.
D8ClusterAutoscalerTooManyErrors CE S8
Cluster-autoscaler is issuing too many errors.
The cluster-autoscaler encountered multiple errors from the cloud provider when attempting to scale the cluster.

Check the logs for details:
d8 k -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler
D8MachineControllerManagerPodIsNotReady CE S8

The POD_NAME Pod is NOT Ready.
D8MachineControllerManagerPodIsNotRunning CE S8
The machine-controller-manager Pod is NOT Running.
The POD_NAME Pod is STATUS.

To check the Pod’s status, run the following command:
d8 k -n NAMESPACE get pods POD_NAME -o json | jq .status
D8MachineControllerManagerPodIsRestartingTooOften CE S9
Too many machine-controller-manager restarts detected.
The machine-controller-manager has restarted VALUE times in the past hour.

Frequent restarts may indicate a problem. The machine-controller-manager is expected to run continuously without interruption.

Check the logs for details:
d8 k -n d8-cloud-instance-manager logs -f -l app=machine-controller-manager -c controller
D8MachineControllerManagerTargetAbsent CE S8
Machine-controller-manager target is missing in Prometheus.
Machine-controller-manager controls ephemeral nodes in the cluster. If it becomes unavailable, it will be impossible to create or delete nodes in the cluster.

To resolve the issue, follow these steps:
1. Check availability and status of machine-controller-manager Pods:
  
  d8 k -n d8-cloud-instance-manager get pods -l app=machine-controller-manager
2. Verify availability of the machine-controller-manager Deployment:
  
  d8 k -n d8-cloud-instance-manager get deployment machine-controller-manager
3. Check the Deployment’s status:
  
  d8 k -n d8-cloud-instance-manager describe deployment machine-controller-manager
D8MachineControllerManagerTargetDown CE S8

Prometheus is unable to scrape the machine-controller-manager’s metrics.
D8NeedMigrateStateToOpenTofu CE S4
Terraform state must be migrated to OpenTofu.
This likely means the automatic migration to OpenTofu was canceled due to destructive changes detected in the cluster state.

Verify the following:
- Current Terraform version:
  
  d8 k -n d8-system exec -it deployments/terraform-state-exporter -c exporter -- terraform version
- Version in the Terraform state:
  
  d8 k -n d8-system get secret d8-cluster-terraform-state -o json | jq -r '.data["cluster-tf-state.json"] | @base64d | fromjson | .terraform_version'
- Check for destructive changes:
  
  d8 k exec -it deploy/terraform-state-exporter -n d8-system -- dhctl terraform check
To resolve the issue and migrate to OpenTofu manually, follow these steps:
1. Using the install container with the previous major Deckhouse release (for example, 1.69.X, while the cluster is now on 1.70.X), run:
  
  dhctl converge
  
  This should resolve all destructive changes in the cluster.
2. After that, the terraform-auto-converger Pod should complete the migration automatically.
If the automatic migration doesn’t happen, use the install container with the current Deckhouse release and run the following command to enforce migration of the Terraform state to OpenTofu:
dhctl converge-migration
If this alert appears in a cluster created by Deckhouse Commander, it means the cluster has been updated to a new Deckhouse version without approving state changes (for example, destructive changes).

To resolve this, follow these steps:
1. In Deckhouse Commander, click Delete manually to temporarily remove the cluster from control.
2. In the cluster, remove the commander-agent module using the following command:
  
  d8 k delete mc commander-agent
3. Using the install container with the previous major Deckhouse release (for example, 1.69.X, while the cluster is now on 1.70.X), run the following command to resolve all destructive changes in the cluster:
  
  dhctl converge
4. In Deckhouse Commander, click Attach to reattach the cluster.
D8NodeGroupIsNotUpdating CE S8

NodeGroup NODE_GROUP_NAME is not handling the update correctly.

There is a new update available for Nodes in the NODE_GROUP_NAME NodeGroup. Although Nodes have detected the update, none of them have received approval to start the update process.

Most likely, there is a problem with the update_approval hook of the node-manager module.
D8NodeIsNotUpdating CE S7
Node NODE_NAME cannot complete the update.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME detected a new update, requested and received approval, started the update, and encountered a step that could cause downtime. The update manager (the update_approval hook of the node-group module) granted downtime approval, but no success message was received, which indicates that the update has not completed.

To investigate the details, view Bashible logs on the Node:
journalctl -fu bashible
D8NodeIsNotUpdating CE S8
Node NODE_NAME cannot complete the update.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME has detected a new update, requested and received approval, but failed to complete the update.

To investigate the details, view Bashible logs on the Node:
journalctl -fu bashible
D8NodeIsNotUpdating CE S9
Node NODE_NAME is not updating.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME has a pending update but it’s neither receiving it nor attempting to.

Most likely, Bashible is not handling the update correctly. It should annotate the Node with update.node.deckhouse.io/waiting-for-approval before the update can proceed.

Options to investigate the details:
- Check the expected configuration checksum for the NodeGroup:
  
  d8 k -n d8-cloud-instance-manager get secret configuration-checksums -o jsonpath={.data.NODE_GROUP_NAME} | base64 -d
- Check the current configuration checksum on the Node:
  
  d8 k get node NODE_NAME -o jsonpath='{.metadata.annotations.node\.deckhouse\.io/configuration-checksum}'
- View Bashible logs on the Node:
  
  journalctl -fu bashible
D8NodeIsUnmanaged CE S9

The NODE_NAME Node is not managed by the node-manager module.

The NODE_NAME Node is not managed by the node-manager module.

To resolve this issue, follow the instructions how to clean up the node and add it to the cluster.
D8NodeUpdateStuckWaitingForDisruptionApproval CE S8

Node NODE_NAME cannot obtain disruption approval.

Node NODE_NAME in NodeGroup NODE_GROUP_NAME detected a new update, requested and received initial approval, and started the update. However, it reached a stage that could cause downtime and was unable to obtain disruption approval. A disruption approval is normally issued automatically by the update_approval hook of the node-manager module.

To resolve this issue, investigate why the approval could not be granted to proceed with the update.
D8ProblematicNodeGroupConfiguration CE S8
Node NODE_NAME cannot begin the update.
There is a new update available for Nodes in the NODE_GROUP_NAME NodeGroup. However, Node NODE_NAME cannot begin the update.

The Node is missing the node.deckhouse.io/configuration-checksum annotation, which may indicate that its bootstrap process did not complete correctly.

Troubleshooting options:
- Check the cloud-init log (/var/log/cloud-init-output.log) on the node.
- Check the NodeGroupConfiguration resource associated with the NODE_GROUP_NAME NodeGroup for potential issues.
EarlyOOMPodIsNotReady CE S8
Pod POD_NAME has detected an unavailable PSI subsystem.
The Pod POD_NAME detected that the Pressure Stall Information (PSI) subsystem is unavailable.

For details, check the logs:
d8 k -n d8-cloud-instance-manager logs POD_NAME
Troubleshooting options:
- Upgrade the Linux kernel to version 4.20 or higher.
- Enable the Pressure Stall Information.
- Disable early OOM.
NodeGroupHasStaticInternalNetworkCIDRsField CE S9

NodeGroup NAME uses deprecated field spec.static.internalNetworkCIDRs.

The spec.static.internalNetworkCIDRs field is deprecated and has been moved to the static cluster configuration.

To resolve this issue, delete the field from NodeGroup NAME. This is safe, as the setting has already been migrated automatically.
NodeGroupMasterTaintIsAbsent CE S4
The master NodeGroup is missing the required control-plane taint.
The master NodeGroup doesn’t have the node-role.kubernetes.io/control-plane: NoSchedule taint. This may indicate a misconfiguration where control-plane nodes can run non-control-plane Pods.

To resolve the issue, add the following to the master NodeGroup spec:
nodeTemplate: taints: - effect: NoSchedule key: node-role.kubernetes.io/control-plane
Note that the taint key: node-role.kubernetes.io/master is deprecated and has no effect starting from Kubernetes 1.24.
NodeGroupReplicasUnavailable CE S7
NodeGroup NODE_GROUP_NAME has no available instances.
This probably means that machine-controller-manager is unable to create Machines using the cloud provider module.

Possible causes:
1. Cloud provider resource limits.
2. No access to the cloud provider API.
3. Misconfiguration of the cloud provider or instance class.
4. Problems with bootstrapping the Machine.
Recommended actions:
1. Check the status of the NodeGroup:
  
  d8 k get ng NODE_GROUP_NAME -o yaml
  
  Look for errors in the .status.lastMachineFailures field.
2. If no Machines stay in the Pending state for more than a couple of minutes, it likely means that Machines are being continuously created and deleted due to an error:
  
  d8 k -n d8-cloud-instance-manager get machine
3. If logs don’t show errors, and a Machine continues to be Pending, check its bootstrap status:
  
  d8 k -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus
4. If the output looks like the example below, connect via nc to examine bootstrap logs:
  
  { "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
5. If there’s no bootstrap log endpoint, cloudInit may not be working correctly. This could indicate a misconfigured instance class in the cloud provider.
NodeGroupReplicasUnavailable CE S8
Too many unavailable instances in the NODE_GROUP_NAME NodeGroup.
The number of simultaneously unavailable instances in the NODE_GROUP_NAME NodeGroup exceeds the allowed threshold. This may indicate that the autoscaler has provisioned too many nodes at once. Check the state of the Machine in the cluster. This probably means that machine-controller-manager is unable to create Machines using the cloud provider module.

Possible causes:
1. Cloud provider resource limits.
2. No access to the cloud provider API.
3. Misconfiguration of the cloud provider or instance class.
4. Problems with bootstrapping the Machine.
Recommended actions:
1. Check the status of the NodeGroup:
  
  d8 k get ng NODE_GROUP_NAME -o yaml
  
  Look for errors in the .status.lastMachineFailures field.
2. If no Machines stay in the Pending state for more than a couple of minutes, it likely means that Machines are being continuously created and deleted due to an error:
  
  d8 k -n d8-cloud-instance-manager get machine
3. If logs don’t show errors, and a Machine continues to be Pending, check its bootstrap status:
  
  d8 k -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus
4. If the output looks like the example below, connect via nc to examine bootstrap logs:
  
  { "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
5. If there’s no bootstrap log endpoint, cloudInit may not be working correctly. This could indicate a misconfigured instance class in the cloud provider.
NodeGroupReplicasUnavailable CE S8
NodeGroup NODE_GROUP_NAME has unavailable instances.
There are VALUE unavailable instances in the NodeGroup NODE_GROUP_NAME. This probably means that machine-controller-manager is unable to create Machines using the cloud provider module.

Possible causes:
1. Cloud provider resource limits.
2. No access to the cloud provider API.
3. Misconfiguration of the cloud provider or instance class.
4. Problems with bootstrapping the Machine.
Recommended actions:
1. Check the status of the NodeGroup:
  
  d8 k get ng NODE_GROUP_NAME -o yaml
  
  Look for errors in the .status.lastMachineFailures field.
2. If no Machines stay in the Pending state for more than a couple of minutes, it likely means that Machines are being continuously created and deleted due to an error:
  
  d8 k -n d8-cloud-instance-manager get machine
3. If logs don’t show errors, and a Machine continues to be Pending, check its bootstrap status:
  
  d8 k -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus
4. If the output looks like the example below, connect via nc to examine bootstrap logs:
  
  { "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
5. If there’s no bootstrap log endpoint, cloudInit may not be working correctly. This could indicate a misconfigured instance class in the cloud provider.
NodeRequiresDisruptionApprovalForUpdate CE S8
Node NODE_NAME requires disruption approval to proceed with the update.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME has detected a new update, received initial approval, and started the update. However, it encountered a stage that may cause downtime and requires manual disruption approval, because the NodeGroup is configured with disruptions.approvalMode: Manual.

To resolve the issue, ensure the Node is ready for unsafe updates (drained) and grant disruption approval by annotating the Node with update.node.deckhouse.io/disruption-approved=.

Caution:
- Nodes in manual mode aren’t drained automatically.
- Do not drain master nodes.
1. To drain the Node and grant the update approval, run the following command:
  
  d8 k drain NODE_NAME --delete-emptydir-data --ignore-daemonsets --force=true && d8 k annotate node NODE_NAME update.node.deckhouse.io/disruption-approved=
2. Uncordon the Node once the update is complete and the annotation update.node.deckhouse.io/approved is removed:
  
  while d8 k get node NODE_NAME -o json | jq -e '.metadata.annotations | has("update.node.deckhouse.io/approved")' > /dev/null; do sleep 1; done d8 k uncordon NODE_NAME
If the NodeGroup has multiple Nodes, repeat this process for each one, since only one Node is updated at a time. Consider temporarily switching to automatic disruption approval (disruptions.approvalMode: Automatic).
NodeStuckInDraining CE S6
Node NODE_NAME is stuck in draining.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME is stuck in the draining process.

To get more details, run the following:
d8 k -n default get event --field-selector involvedObject.name=NODE_NAME,reason=DrainFailed --sort-by='.metadata.creationTimestamp'
Error message: MESSAGE_CONTENTS
NodeStuckInDrainingForDisruptionDuringUpdate CE S6
Node NODE_NAME is stuck in draining.
Node NODE_NAME in NodeGroup NODE_GROUP_NAME has detected a new update, requested and received approval, started the update, and reached a step that could cause downtime. It is currently stuck in the draining process while waiting for automatic disruption approval.

To get more details, run the following:
d8 k -n default get event --field-selector involvedObject.name=NODE_NAME,reason=ScaleDown --sort-by='.metadata.creationTimestamp'
StandbyHolderDeploymentReplicasUnavailable CE S5

No available replicas in Deployment standby-holder-NODE_GROUP_NAME.

Deckhouse has detected that there are no available replicas in the standby-holder-NODE_GROUP_NAME Deployment in the d8-cloud-instance-manager namespace.

Check the Deployment and associated Pods for issues.
UnmetCloudConditions CE S6
Deckhouse update is unavailable due to unmet cloud provider conditions.
Deckhouse has detected that some cloud provider–specific conditions have not been met. Until these issues are resolved, updating to the new Deckhouse release is not possible.

Troubleshooting details:
- Name: NAME
- Message: MESSAGE_CONTENTS

Модуль okmeter

D8OkmeterAgentPodIsNotReady CE S6

Okmeter agent is not Ready.

Модуль openvpn

OpenVPNClientCertificateExpired CE S4
OpenVPN client certificate expired for XXX
The OpenVPN client certificate for XXX has expired.

Check the certificate information and make sure it actually expires:
d8 k -n d8-openvpn get secrets -l name="XXX" -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
Renew or delete the expired certificate. Check the documentation for details.
OpenVPNClientCertificateExpired CE S5
OpenVPN client certificate expires for XXX in less than 7 days.
The OpenVPN client certificate for XXX is set to expire in VALUE days.

Renewal of certificate required. Check the documentation for details.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets -l name=XXX -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
OpenVPNClientCertificateExpired CE S6
OpenVPN client certificate expires for XXX in less than 30 days.
The OpenVPN client certificate for XXX is set to expire in VALUE days.

Renewal of certificate required. Check the documentation for details.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets -l name=XXX -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
OpenVPNServerCACertificateExpired CE S4
OpenVPN CA certificate has expired.
OpenVPN CA certificate renews automatically 1 days before expiration. Automatic rotation did not work. Check the module status or perform manual rotation.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets openvpn-pki-ca -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
The hook should rotate CA and server certificates before expiry.
OpenVPNServerCACertificateExpiresTomorrow CE S4
OpenVPN CA certificate will expire tomorrow
OpenVPN CA certificate renews automatically 1 days before expiration. Automatic rotation did not work. Check the module status or perform manual rotation.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets openvpn-pki-ca -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
The hook should rotate CA and server certificates before expiry.
OpenVPNServerCACertificateExpiringInAWeek CE S5
OpenVPN CA certificate expires in less than 7 days.
The OpenVPN CA certificate is set to expire in VALUE days.

Renew the CA certificate if necessary. After renewing the certificate, you will need to reissue all client and server certificates. Check the documentation for details.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets -l name=server -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
OpenVPNServerCACertificateExpiringSoon CE S5
OpenVPN CA certificate expires in less than 30 days.
The OpenVPN CA certificate is set to expire in VALUE days.

Renew the CA certificate if necessary. After renewing the certificate, you will need to reissue all client and server certificates. Check the documentation for details.

View certificate details to display the exact expiration date:
d8 k -n d8-openvpn get secrets -l name=server -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
OpenVPNServerCertificateExpired CE S4
OpenVPN server certificate has expired.
The OpenVPN server certificate has expired.

OpenVPN server certificate renews automatically 1 days before expiration. Automatic rotation did not work. Check the module status or perform manual rotation.

Check the certificate information and make sure it actually expires.
d8 k -n d8-openvpn get secrets -l name=server -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout
OpenVPNServerCertificateExpiresTomorrow CE S4
OpenVPN server certificate will expire tomorrow.
The OpenVPN server certificate is set to expire in less than 1 day.

OpenVPN server certificate renews automatically 1 days before expiration. Automatic rotation did not work. Check the module status or perform manual rotation.

Check the certificate information and make sure it actually expires:
d8 k -n d8-openvpn get secrets -l name=server -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d | openssl x509 -text -noout

Модуль operator-prometheus

D8PrometheusOperatorPodIsNotReady CE S7
The prometheus-operator Pod is NOT Ready.
As a result, new Prometheus, PrometheusRules, and ServiceMonitor configurations cannot be applied in the cluster. However, all existing and configured components will continue to operate normally. This problem will not affect alerting or monitoring in the short term (for a few days).

Troubleshooting steps:
1. Analyze the Deployment details:
  
  d8 k -n d8-operator-prometheus describe deployment prometheus-operator
2. Examine the Pod’s to determine why it is not running:
  
  d8 k -n d8-operator-prometheus describe pod -l app=prometheus-operator
D8PrometheusOperatorPodIsNotRunning CE S7
The prometheus-operator Pod is NOT Running.
As a result, new Prometheus, PrometheusRules, and ServiceMonitor configurations cannot be applied in the cluster. However, all existing and configured components will continue to operate normally. This problem will not affect alerting or monitoring in the short term (for a few days).

Troubleshooting steps:
1. Analyze the Deployment details:
  
  d8 k -n d8-operator-prometheus describe deployment prometheus-operator
2. Examine the Pod’s to determine why it is not running:
  
  d8 k -n d8-operator-prometheus describe pod -l app=prometheus-operator
D8PrometheusOperatorTargetAbsent CE S7
Prometheus-operator target is missing in Prometheus.
As a result, new Prometheus, PrometheusRules, and ServiceMonitor configurations cannot be applied in the cluster. However, all existing and configured components will continue to operate normally. This problem will not affect alerting or monitoring in the short term (for a few days).

To resolve the issue, analyze the Deployment details:
d8 k -n d8-operator-prometheus describe deployment prometheus-operator
D8PrometheusOperatorTargetDown CE S8
Prometheus is unable to scrape prometheus-operator metrics.
The prometheus-operator Pod is unavailable.

As a result, new Prometheus, PrometheusRules, and ServiceMonitor configurations cannot be applied in the cluster. However, all existing and configured components will continue to operate normally. This problem will not affect alerting or monitoring in the short term (for a few days).

Troubleshooting steps:
1. Analyze the Deployment details:
  
  d8 k -n d8-operator-prometheus describe deployment prometheus-operator
2. Examine the Pod’s to determine why it is not running:
  
  d8 k -n d8-operator-prometheus describe pod -l app=prometheus-operator

Модуль prometheus

D8GrafanaDeploymentReplicasUnavailable CE S6
One or more Grafana Pods are NOT Running.
The number of Grafana replicas is lower than the specified number.

The Deployment is in the MinimumReplicasUnavailable state.

Troubleshooting options:
- To check the Deployment’s status:
  
  d8 k -n d8-monitoring get deployment grafana-v10 -o json | jq .status
- To check a Pod’s status:
  
  d8 k -n d8-monitoring get pods -l app=grafana-v10 -o json | jq '.items[] | {(.metadata.name):.status}'
D8GrafanaDeprecatedCustomDashboardDefinition CE S9

Deprecated ConfigMap for Grafana dashboards detected.

The ConfigMap grafana-dashboard-definitions-custom has been found in the d8-monitoring namespace. This indicates that a deprecated method for registering custom dashboards in Grafana is used.

This method is no longer supported.

Migrate to using the custom GrafanaDashboardDefinition resource instead.
D8GrafanaPodIsNotReady CE S6

The Grafana Pod is NOT Ready.
D8GrafanaPodIsRestartingTooOften CE S9
Excessive Grafana restarts detected.
Grafana has restarted VALUE times in the last hour.

Frequent restarts indicate a problem. Grafana is expected to run continuously without interruption.

To investigate the issue, check the logs:
d8 k -n d8-monitoring logs -f -l app=grafana-v10 -c grafana
D8GrafanaTargetAbsent CE S6
Grafana target is missing in Prometheus.
Grafana visualizes metrics collected by Prometheus. Grafana is critical for some tasks, such as monitoring the state of applications and the cluster as a whole. Additionally, Grafana’s unavailability can negatively impact users who actively use it in their work.

The recommended course of action:
1. Check the availability and status of Grafana Pods:
  
  d8 k -n d8-monitoring get pods -l app=grafana-v10
2. Check the availability of the Grafana Deployment:
  
  d8 k -n d8-monitoring get deployment grafana-v10
3. Examine the status of the Grafana Deployment:
  
  d8 k -n d8-monitoring describe deployment grafana-v10
D8GrafanaTargetDown CE S6

Prometheus is unable to scrape Grafana metrics.
D8PrometheusLongtermFederationTargetDown CE S5
Prometheus-longterm cannot scrape Prometheus.
The prometheus-longterm instance is unable to scrape the /federate endpoint from Prometheus.

Troubleshooting options:
- Check the prometheus-longterm logs.
- Open the corresponding web UI to check scrape errors.
D8PrometheusLongtermTargetAbsent CE S7
Prometheus-longterm target is missing in Prometheus.
The prometheus-longterm component is used to display historical monitoring data and is not crucial. However, its extended downtime may prevent access to statistics.

This issue is often caused by problems with disk availability. For example, if the disk cannot be mounted to a Node.

Troubleshooting steps:
1. Check the StatefulSet status:
  
  d8 k -n d8-monitoring describe statefulset prometheus-longterm
2. Inspect the PersistentVolumeClaim (if used):
  
  d8 k -n d8-monitoring describe pvc prometheus-longterm-db-prometheus-longterm-0
3. Inspect the Pod’s state:
  
  d8 k -n d8-monitoring describe pod prometheus-longterm-0
D8TricksterTargetAbsent CE S5
Trickster target is missing in Prometheus.
The following modules use the Trickster component:
- prometheus-metrics-adapter: Its unavailability means horizontal pod autoscaling (HPA) is not working, and you cannot view resource consumption using d8 k.
- vertical-pod-autoscaler: Short-term unavailability for this module is tolerable, as VPA looks at the consumption history for 8 days.
- grafana: All dashboards use Trickster by default to cache Prometheus queries. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and cause unavailability.
Troubleshooting steps:
1. Inspect the Deployment’s stats:
  
  d8 k -n d8-monitoring describe deployment trickster
2. Inspect the Pod’s stats:
  
  d8 k -n d8-monitoring describe pod -l app=trickster
3. Trickster often becomes unavailable due to Prometheus issues, since its readinessProbe depends on Prometheus being accessible.
  
  Make sure Prometheus is running:
  
  d8 k -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main
D8TricksterTargetAbsent CE S5
Trickster target is missing in Prometheus.
The following modules use the Trickster component:
- prometheus-metrics-adapter: Its unavailability means horizontal pod autoscaling (HPA) is not working, and you cannot view resource consumption using d8 k.
- vertical-pod-autoscaler: Short-term unavailability for this module is tolerable, as VPA looks at the consumption history for 8 days.
- grafana: All dashboards use Trickster by default to cache Prometheus queries. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and cause unavailability.
Troubleshooting steps:
1. Inspect the Deployment’s stats:
  
  d8 k -n d8-monitoring describe deployment trickster
2. Inspect the Pod’s stats:
  
  d8 k -n d8-monitoring describe pod -l app=trickster
3. Trickster often becomes unavailable due to Prometheus issues, since its readinessProbe depends on Prometheus being accessible.
  
  Make sure Prometheus is running:
  
  d8 k -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main
DeckhouseModuleUseEmptyDir CE S9

Deckhouse module MODULE_NAME is using emptyDir for storage.

The Deckhouse module MODULE_NAME is using emptyDir as its storage.

If the associated Pod is removed from the node for any reason, the data in the emptyDir is deleted permanently. Consider using persistent storage if data durability is critical for the module.
GrafanaDashboardAlertRulesDeprecated CE S8
Deprecated Grafana alerts detected.
Before upgrading to Grafana 10, migrate outdated alerts from Grafana to an external alertmanager (or exporter-alertmanager stack).

To list all deprecated alert rules, use the following expression:
sum by (dashboard, panel, alert_rule) (d8_grafana_dashboards_deprecated_alert_rule) > 0
Note that this check runs once per hour, so the alert may take up to an hour to clear after deprecated resources are migrated.
GrafanaDashboardPanelIntervalDeprecated CE S8
Deprecated Grafana panel intervals detected.
Before upgrading to Grafana 10, update outdated panel expressions that use $interval_rv, interval_sx3, or interval_sx4 to the new variable $__rate_interval.

To list all deprecated panel intervals, use the following expression:
sum by (dashboard, panel, interval) (d8_grafana_dashboards_deprecated_interval) > 0
Note that this check runs once per hour, so the alert may take up to an hour to clear after deprecated intervals are removed.
GrafanaDashboardPluginsDeprecated CE S8
Deprecated Grafana plugins detected.
Before upgrading to Grafana 10, make sure that the currently installed plugins will work correctly with Grafana 10.

To list all potentially outdated plugins, use the following expression:
sum by (dashboard, panel, plugin) (d8_grafana_dashboards_deprecated_plugin) > 0
The flant-statusmap-panel plugin is deprecated and no longer supported. It’s recommended that you migrate to the state timeline plugin instead.

Note that this check runs once per hour, so the alert may take up to an hour to clear after deprecated resources are migrated.
K8STooManyNodes CE S7

Node count is approaching the maximum allowed.

The cluster is currently running VALUE nodes, which is close to the maximum of XXX allowed nodes.
PrometheusDirectAccessDeprecated CE S6

Deprecated Ingress with direct Prometheus access found in cluster.

The cluster has an Ingress that allows direct access to Prometheus with client certificate authentication, e.g. for the external metrics access This approach is deprecated and the direct access to Prometheus will not be possible in the future Deckhouse Kubernetes Platform releases.

To resolve this issue, remove the deprecated Ingress and use the observability module to manage Prometheus metrics access.
PrometheusDiskUsage CE S4
Prometheus disk usage exceeds 95%.
Deckhouse has detected that the Prometheus disk is over 95% full.

To check the current usage, run the following command:
d8 k -n NAMESPACE exec -ti POD_NAME -c prometheus -- df -PBG /prometheus
Consider expanding disk size following the guide.
PrometheusLongtermRotatingEarlierThanConfiguredRetentionDays CE S4
Prometheus-longterm data is rotating earlier than the configured retention period.
The prometheus-longterm instance is rotating data earlier than specified by the longtermRetentionDays parameter.

Troubleshooting options:
- Increase the Prometheus disk size.
- Reduce the number of metrics collected.
- Lower the longtermRetentionDays parameter in the module configuration.
PrometheusMainRotatingEarlierThanConfiguredRetentionDays CE S4
Prometheus-main data is rotating earlier than the configured retention period.
The prometheus-main instance is rotating data earlier than specified by the retentionDays parameter.

Troubleshooting options:
- Increase the Prometheus disk size.
- Reduce the number of metrics collected.
- Lower the retentionDays parameter in the module configuration.
PrometheusScapeConfigDeclarationDeprecated CE S8

AdditionalScrapeConfigs from Secrets will be deprecated soon.

Defining additional scrape configurations using Secrets will be deprecated in prometheus-operator versions later than v0.65.1.

Migrate to using the ScrapeConfig custom resource instead.

PrometheusServiceMonitorDeprecated CE S8

Deprecated Prometheus ServiceMonitor detected.

The Kubernetes cluster uses the modern EndpointSlice network mechanism, but the ServiceMonitor NAMESPACE/NAME still uses relabeling rules based on the old Endpoints mechanism (starting with __meta_kubernetes_endpoints_).

Support for the old relabeling rules will be removed in a future Deckhouse release (1.60). To avoid issues, update your ServiceMonitor to use EndpointSlice-based relabeling.

Update relabeling rules as follows:

__meta_kubernetes_endpoints_name -> __meta_kubernetes_endpointslice_name
__meta_kubernetes_endpoints_label_XXX -> __meta_kubernetes_endpointslice_label_XXX
__meta_kubernetes_endpoints_labelpresent_XXX -> __meta_kubernetes_endpointslice_labelpresent_XXX
__meta_kubernetes_endpoints_annotation_XXX -> __meta_kubernetes_endpointslice_annotation_XXX
__meta_kubernetes_endpoints_annotationpresent_XXX -> __meta_kubernetes_endpointslice_annotationpresent_XXX
__meta_kubernetes_endpoint_node_name -> __meta_kubernetes_endpointslice_endpoint_topology_kubernetes_io_hostname
__meta_kubernetes_endpoint_ready -> __meta_kubernetes_endpointslice_endpoint_conditions_ready
__meta_kubernetes_endpoint_port_name -> __meta_kubernetes_endpointslice_port_name
__meta_kubernetes_endpoint_port_protocol -> __meta_kubernetes_endpointslice_port_protocol
__meta_kubernetes_endpoint_address_target_kind -> __meta_kubernetes_endpointslice_address_target_kind
__meta_kubernetes_endpoint_address_target_name -> __meta_kubernetes_endpointslice_address_target_name

TargetDown CE S5

A target is down.

Deckhouse has detected that the JOB_NAME target is is currently unreachable.
TargetDown CE S6

A target is down.

Deckhouse has detected that the JOB_NAME target is is currently unreachable.
TargetDown CE S7

A target is down.

Deckhouse has detected that the JOB_NAME target is is currently unreachable.
TargetSampleLimitExceeded CE S6

Scrapes are exceeding the sample limit.

One or more targets are down because the Prometheus sample limit was exceeded during a scrape.
TargetSampleLimitExceeded CE S7

Sampling limit is nearly reached.

The target is close to exceeding the Prometheus’s sampling limit. Less than 10% of the allowed samples are left.

Модуль secret-copier

D8SecretCopierDeprecatedLabels CE S9

Obsolete antiopa_secret_copier=yes label detected.

The secret-copier module has changed its labeling approach for original secrets in the default namespace.

The label antiopa-secret-copier: "yes" is deprecated and will be removed soon.

To resolve this issue, replace the label antiopa-secret-copier: "yes" with secret-copier.deckhouse.io/enabled: "" for all secrets used by the secret-copier module in the default namespace.

Модуль terraform-manager

D8TerraformStateExporterClusterStateChanged CE S8
Terraform-state-exporter cluster state change detected.
The current Kubernetes cluster state is STATUS_REFERENCE compared to the Terraform state.

It is important to reconcile the states.

Troubleshooting steps:
1. View the differences:
  
  dhctl terraform check
2. Apply the necessary changes to bring the cluster in sync:
  
  dhctl converge
D8TerraformStateExporterClusterStateError CE S8
Terraform-state-exporter cluster state error.
The terraform-state-exporter can’t check difference between the Kubernetes cluster state and the Terraform state.

That was likely caused by terraform-state-exporter failing to run Terraform with the current state and configuration.

Troubleshooting steps:
1. View the differences:
  
  dhctl terraform check
2. Apply the necessary changes to bring the cluster in sync:
  
  dhctl converge
D8TerraformStateExporterHasErrors CE S8
Terraform-state-exporter has encountered errors.
Errors occurred during the operation of the terraform-state-exporter.

To get more details, check the Pod logs:
d8 k -n d8-system logs -l app=terraform-state-exporter -c exporter
D8TerraformStateExporterNodeStateChanged CE S8
Terraform-state-exporter node state change detected.
The current state of node NODE_GROUP_NAME/NAME is STATUS_REFERENCE compared to the Terraform state.

It is important to reconcile the states.

Troubleshooting steps:
1. View the differences:
  
  dhctl terraform check
2. Apply the necessary changes to bring the cluster in sync:
  
  dhctl converge
D8TerraformStateExporterNodeStateError CE S8
Terraform-state-exporter node state error.
The terraform-state-exporter can’t check the difference between the node NODE_GROUP_NAME/NAME state and the Terraform state.

Probably, it occurred because terraform-manager had failed to run Terraform with the current state and configuration.

Troubleshooting steps:
1. View the differences:
  
  dhctl terraform check
2. Apply the necessary changes to bring the cluster in sync:
  
  dhctl converge
D8TerraformStateExporterNodeTemplateChanged CE S8
Terraform-state-exporter node template change detected.
The terraform-state-exporter has detected a mismatch between the node template in the cluster provider configuration and the one specified in the NodeGroup NAME`.

Node template is STATUS_REFERENCE.

Troubleshooting steps:
1. View the differences:
  
  dhctl terraform check
2. Adjust NodeGroup settings to fix the issue or bring the cluster in sync via the following command:
  
  dhctl converge
D8TerraformStateExporterPodIsNotReady CE S8
Terraform-state-exporter Pod is not Ready.
The terraform-state-exporter cannot check the difference between the actual Kubernetes cluster state and the Terraform state.

To resolve the issue, check the following:
1. Deployment description:
  
  d8 k -n d8-system describe deployment terraform-state-exporter
2. Pod status:
  
  d8 k -n d8-system describe pod -l app=terraform-state-exporter
D8TerraformStateExporterPodIsNotRunning CE S8
Terraform-state-exporter Pod is not Running.
The terraform-state-exporter cannot check the difference between the actual Kubernetes cluster state and the Terraform state.

To resolve the issue, check the following:
1. Deployment description:
  
  d8 k -n d8-system describe deployment terraform-state-exporter
2. Pod status:
  
  d8 k -n d8-system describe pod -l app=terraform-state-exporter
D8TerraformStateExporterTargetAbsent CE S8
Terraform-state-exporter target is missing in Prometheus.
Prometheus cannot find the terraform-state-exporter target.

To investigate the details:
- Check the Pod status:
  
  d8 k -n d8-system get pod -l app=terraform-state-exporter
- Check the container logs:
  
  d8 k -n d8-system logs -l app=terraform-state-exporter -c exporter
D8TerraformStateExporterTargetDown CE S8
Prometheus can't scrape terraform-state-exporter.
Prometheus is unable to scrape metrics from the terraform-state-exporter.

To investigate the details:
- Check the Pod status:
  
  d8 k -n d8-system get pod -l app=terraform-state-exporter
- Check the container logs:
  
  d8 k -n d8-system logs -l app=terraform-state-exporter -c exporter

Модуль upmeter

D8SmokeMiniNotBoundPersistentVolumeClaims CE S9
Smoke-mini has unbound or lost PersistentVolumeClaims.
The PersistentVolumeClaim PVC_NAME is in the STATUS phase.

This indicates a problem with PersistentVolume provisioning.

To investigate the cause, check the PersistentVolumeClaim phase:
d8 k -n d8-upmeter get pvc PVC_NAME
If your cluster doesn’t have a disk provisioning system, you can disable volume ordering for smoke-mini using the module settings.
D8UpmeterAgentPodIsNotReady CE S6

Upmeter agent is not Ready.
D8UpmeterAgentReplicasUnavailable CE S6
One or more upmeter-agent Pods are NOT Running.
Some upmeter-agent Pods are not in the Running state.

To investigate the details, do the following:
- Check the DaemonSet status:
  
  d8 k -n d8-upmeter get daemonset upmeter-agent -o json | jq .status
- Check the Pod status:
  
  d8 k -n d8-upmeter get pods -l app=upmeter-agent -o json | jq '.items[] | {(.metadata.name):.status}'
D8UpmeterDiskUsage CE S5
Upmeter disk usage exceeds 80%.
Disk usage for Upmeter has exceeded 80%.

The only way to resolve this is to recreate the PersistentVolumeClaim (PVC) in the following steps:
1. Save the PVC data if you need it.
2. Delete the PVC and restart the upmeter Pod:
  
  d8 k -n d8-upmeter delete persistentvolumeclaim/data-upmeter-0 pod/upmeter-0
3. Check the status of the created PVC:
  
  d8 k -n d8-upmeter get pvc
D8UpmeterProbeGarbageConfigmap CE S9
Garbage ConfigMaps from the basic probe are not being cleaned up.
Probe-generated ConfigMaps were found but not cleaned up as expected.

upmeter-agent should automatically delete ConfigMaps produced by the control-plane/basic probe. There should be no more ConfigMaps than there are master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and they should be deleted within seconds.

This may indicate:
- A problem with the kube-apiserver.
- Stale ConfigMaps left by outdated upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check the upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "basic-functionality") | [.time, .level, .msg] | @tsv'
2. Ensure the control plane is operating normally.
3. Delete stale ConfigMaps manually:
  
  d8 k -n d8-upmeter delete cm -l heritage=upmeter
D8UpmeterProbeGarbageDeployment CE S9
Garbage Deployments from the controller-manager probe are not being cleaned up.
The average number of probe-generated Deployments per upmeter-agent Pod is VALUE.

upmeter-agent should automatically delete Deployments created by the control-plane/controller-manager probe. There should not be more Deployments than master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and they should be deleted within seconds.

This may indicate:
- A problem with the kube-apiserver.
- Stale Deployments left by outdated upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check the upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'
2. Ensure the control plane is operating normally. Pay close attention to kube-controller-manager.
3. Delete stale Deployments manually:
  
  d8 k -n d8-upmeter delete deployment -l heritage=upmeter
D8UpmeterProbeGarbageNamespaces CE S9
Garbage namespaces from the namespace probe are not being cleaned up.
The average number of probe-created namespaces per upmeter-agent Pod is VALUE.

upmeter-agent should automatically clean up namespaces created by the control-plane/namespace probe.
There should not be more of these namespaces than there are master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and they should be deleted within seconds.

This may indicate:
- A problem with the kube-apiserver.
- Stale namespaces left from older upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "namespace") | [.time, .level, .msg] | @tsv'
2. Ensure the control plane is operating normally.
3. Delete stale namespaces manually:
  
  d8 k -n d8-upmeter delete ns -l heritage=upmeter
D8UpmeterProbeGarbagePods CE S9
Garbage Pods from the scheduler probe are not being cleaned up.
The average number of probe Pods per upmeter-agent Pod is VALUE.

upmeter-agent should automatically clean up Pods created by the control-plane/scheduler probe.
There should not be more of these Pods than there are master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and they should be deleted within seconds.

This may indicate:
- A problem with the kube-apiserver.
- Stale Pods left from old upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "scheduler") | [.time, .level, .msg] | @tsv'
2. Ensure the control plane is operating normally.
3. Delete stale Pods manually:
  
  d8 k -n d8-upmeter delete pods -l upmeter-probe=scheduler
D8UpmeterProbeGarbagePodsFromDeployments CE S9
Garbage Pods from the controller-manager probe are not being cleaned up.
The average number of probe Pods per upmeter-agent Pod is VALUE.

upmeter-agent is expected to clean Deployments created by the control-plane/controller-manager probe, and kube-controller-manager should remove the associated Pods. There should not be more of these Pods than there are master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and the Pods should be deleted within seconds.

This may indicate:
- A problem with the kube-apiserver or kube-controller-manager.
- Stale Pods left from outdated upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check the upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'
2. Ensure the control plane is operating normally. Pay close attention to kube-controller-manager.
3. Delete stale Pods manually:
  
  d8 k -n d8-upmeter delete pods -l upmeter-probe=controller-manager
D8UpmeterProbeGarbageSecretsByCertManager CE S9
Garbage Secrets from the cert-manager probe are not being cleaned up.
Probe-generated Secrets were found.

upmeter-agent should clean up certificates created by the cert-manager probe, and cert-manager in turn should clean up the associated Secrets. There should not be more of these Secrets than there are master nodes (since upmeter-agent runs as a DaemonSet with a master nodeSelector), and they should be deleted within seconds.

This may indicate:
- A problem with kube-apiserver, cert-manager, or upmeter.
- Stale Secrets left from older upmeter-agent Pods after an Upmeter update.
Troubleshooting steps:
1. Check upmeter-agent logs:
  
  d8 k -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "cert-manager") | [.time, .level, .msg] | @tsv'
2. Check that the control plane and cert-manager are operating normally.
3. Delete stale certificates and Secrets manually:
  
  d8 k -n d8-upmeter delete certificate -l upmeter-probe=cert-manager d8 k -n d8-upmeter get secret -ojson | jq -r '.items[] | .metadata.name' | grep upmeter-cm-probe | xargs -n 1 -- d8 k -n d8-upmeter delete secret
D8UpmeterServerPodIsNotReady CE S6

Upmeter server is not Ready.
D8UpmeterServerPodIsRestartingTooOften CE S9
Upmeter server is restarting too frequently.
The upmeter server has restarted VALUE times in the last hour.

Frequent restarts may indicate an issue. It is expected to run continuously and collect availability episodes without interruption.

To investigate the cause, check the logs:
d8 k -n d8-upmeter logs -f upmeter-0 upmeter
D8UpmeterServerReplicasUnavailable CE S6
One or more Upmeter server Pods are NOT Running.
Some upmeter Pods are not in the Running state.

To investigate the details, do the following:
- Check the StatefulSet status:
  
  d8 k -n d8-upmeter get statefulset upmeter -o json | jq .status
- Check the Pod status:
  
  d8 k -n d8-upmeter get pods upmeter-0 -o json | jq '.items[] | {(.metadata.name):.status}'
D8UpmeterSmokeMiniMoreThanOnePVxPVC CE S9
Unnecessary smoke-mini PersistentVolumes detected in the cluster.
The number of unnecessary smoke-mini PersistentVolumes (PVs) is VALUE.

These PVs should be deleted automatically when released.

Possible causes:
- The smoke-mini StorageClass may be set to Retain by default.
- There may be issues with the CSI driver or cloud storage integration.
These PVs do not contain valuable data and should be deleted.

To list all the PVs, run the following command:
d8 k get pv | grep disk-smoke-mini
D8UpmeterTooManyHookProbeObjects CE S9
Too many UpmeterHookProbe objects in the cluster.
The average number of UpmeterHookProbe objects per upmeter-agent Pod is VALUE, but it should always be exactly 1 per agent.

This likely happened because older upmeter-agent Pods left behind their UpmeterHookProbe objects during an Upmeter update or downscale.

Once the cause has been investigated, remove outdated objects and leave only the ones corresponding to currently running upmeter-agent Pods.

To view all UpmeterHookProbe objects, run the following command:
d8 k get upmeterhookprobes.deckhouse.io

Модуль user-authn

D8DexAllTargetsDown CE S6

Prometheus is unable to scrape Dex metrics.

Была ли страница полезна?

Спасибо за оценку!

Ваш отклик обрабатыватся, внести изменения можно будет через 5 минут

Произошла ошибка

Пожалуйста, попробуйте позже

Расскажите, что не понравилось

Мало полезного Есть ошибки Непонятно написано Устаревшая информация Другое

Список алертов

Критичность алерта

Модуль admission-policy-engine

Модуль cert-manager

Модуль chrony

Модуль cloud-provider-yandex

Модуль cni-cilium

Модуль control-plane-manager

Модуль documentation

Модуль extended-monitoring

Модуль ingress-nginx

Модуль istio

Модуль kube-dns

Модуль log-shipper

Модуль loki

Модуль metallb

Модуль monitoring-applications

Модуль monitoring-custom

Модуль monitoring-deckhouse

Модуль monitoring-kubernetes

Модуль monitoring-kubernetes-control-plane

Модуль monitoring-ping

Модуль node-local-dns

Модуль node-manager

Модуль okmeter

Модуль openvpn

Модуль operator-prometheus

Модуль prometheus

Модуль secret-copier

Модуль terraform-manager

Модуль upmeter

Модуль user-authn

Произошла ошибка

Расскажите, что не понравилось

Запросить пробный доступ

Запрос получен

Ошибка

Связаться со специалистом Deckhouse

Заявка отправлена

Возникла ошибка отправки формы

Запросить обучение

Запрос получен

Ошибка

Запросить демо

Запрос получен

Ошибка

Получите отчет о соответствии рекомендациям PCI SSC

Спасибо

Ошибка

Запросить подробности партнёрской программы

Запрос получен

Ошибка