Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Usually there are S3 access errors in the logs, meaning that the checkpoint snapshot is corrupted. We have fixed the root cause, but a “factory reset” is required:

...

Sometimes you may see flink-jobmanager crashing and not coming up, which means one of the checkpoint snapshots is corrupted. To fix it you can use one of the following methods:

  1. Scale down flink-jobmanager deployment to 0.

  2. Scale down flink-taskmanager deployment to 0.

  3. Go to storage/configmaps

  4. Delete

    1. gv-flink-cluster-config-map

    It’s a good idea to delete
    1. gv-flink-*-config-map

    as well
  5. Delete flink-taskmanager pod, it will recreate

  6. Wait 15-20 secDelete

  7. Scale up flink-taskmanager deployment to 1.

  8. Scale up flink-jobmanager pod, it will start successfully deployment to 1.

...