Usually there are S3 access errors in the logs, meaning that the checkpoint snapshot is corrupted. We have fixed the root cause, but a “factory reset” is required:
Delete ConfigMap
gv-flink-cluster-config-map
It’s a good idea to delete
gv-flink-*-config-map
as wellDelete flink-taskmanager pod, it will recreate
Wait 15-20 sec
Delete flink-jobmanager pod, it will start successfully
In terminal:
kubectl scale --replicas=0 deployment/flink-jobmanager kubectl scale --replicas=0 deployment/flink-taskmanager kubectl get configmap -n default kubectl delete configmap gv-flink-cluster-config-map -n default kubectl scale --replicas=1 deployment/flink-jobmanager kubectl scale --replicas=1 deployment/flink-taskmanager