Usually there are S3 access errors in the logs, meaning that the checkpoint snapshot is corrupted. We have fixed the root cause, but a “factory reset” is required:
...
Sometimes you may see flink-jobmanager crashing and not coming up, which means one of the checkpoint snapshots is corrupted. To fix it you can use one of the following methods:
In Rancher
Scale down
flink-jobmanager
deployment to 0.Scale down
flink-taskmanager
deployment to 0.Go to storage/configmaps
Delete
gv-flink-cluster-config-map
gv-flink-*-config-map
Delete Scale up
flink-taskmanager
pod, it will recreate deployment to 1.Wait 15-20 sec
Delete Scale up
flink-jobmanager
pod, it will start successfully deployment to 1.
In terminal
...
Code Block |
---|
kubectl scale --replicas=0 deployment/flink-jobmanager kubectl scale --replicas=0 deployment/flink-taskmanager kubectl get configmap -n default | grep "gv-flink" kubectl delete configmap gv-flink-cluster-config-map -n default kubectl delete configmap gv-flink-*-config-map -n default <--- insert config map names from previous command kubectl scale --replicas=1 deployment/flink-jobmanagertaskmanager sleep 20 kubectl scale --replicas=1 deployment/flink-taskmanagerjobmanager |