Usually there are S3 access errors in the logs, meaning that the checkpoint snapshot is corrupted. We have fixed the root cause, but a “factory reset” is required:
...
Sometimes you may see flink-jobmanager crashing and not coming up, which means one of the checkpoint snapshots is corrupted. To fix it you can use one of the following methods:
In Rancher
Scale down
flink-jobmanager
deployment to 0.Scale down
flink-taskmanager
deployment to 0.Go to storage/configmaps
Delete
gv-flink-cluster-config-map
gv-flink-*-config-map
Delete flink-taskmanager pod, it will recreate
Wait 15-20 secDelete
Scale up
flink-taskmanager
deployment to 1.Scale up
flink-jobmanager
pod, it will start successfully deployment to 1.
In terminal
...
Code Block |
---|
kubectl scale --replicas=0 deployment/flink-jobmanager kubectl scale --replicas=0 deployment/flink-taskmanager kubectl get configmap -n default kubectl delete configmap gv-flink-cluster-config-map -n default kubectl delete configmap gv-flink-*-config-map -n default <--- insert config map name from previous command kubectl scale --replicas=1 deployment/flink-jobmanager kubectl scale --replicas=1 deployment/flink-taskmanager |