Fix pods stuck terminating after node failure
Node failures can lead to pods beeing stuck in terminating and not beeing evicted. This requires either node removal (with the API) or force-removing pods. Descheduler can solve this issue.
Example Descheduler configuration values:
replicas: 2kind: DeploymentdeschedulerPolicyAPIVersion: descheduler/v1alpha2deschedulerPolicy: profiles: - name: Default pluginConfig: - name: DefaultEvictor args: evictFailedBarePods: true evictLocalStoragePods: true evictSystemCriticalPods: true - name: RemoveFailedPods args: reasons: - ContainerStatusUnknown - NodeAffinity - NodeShutdown - Terminated - UnexpectedAdmissionError includingInitContainers: true excludeOwnerKinds: - Job minPodLifetimeSeconds: 1800 - name: RemovePodsViolatingInterPodAntiAffinity - name: RemovePodsViolatingNodeAffinity args: nodeAffinityType: - requiredDuringSchedulingIgnoredDuringExecution - name: RemovePodsViolatingNodeTaints - name: RemovePodsViolatingTopologySpreadConstraint plugins: balance: enabled: - RemovePodsViolatingTopologySpreadConstraint deschedule: enabled: - RemoveFailedPods - RemovePodsViolatingInterPodAntiAffinity - RemovePodsViolatingNodeAffinity - RemovePodsViolatingNodeTaintsservice: enabled: trueserviceMonitor: enabled: trueleaderElection: enabled: true