I am trying to understand the behavior of a distributed job when a job fails on one node but not on the others. Is there a way to restart the failed node and continue executing tasks for the distributed job?
Kishan Bhoopalam
Asked on Feb 07, 2024
In a distributed job, if a job fails on one node but not on the others, the other nodes will continue running. The failure of a job on one node does not terminate the job processes on other nodes. To restart a failed node and continue executing tasks for the distributed job, the app framework needs to have a way to determine that a process within the 'world' has exited. Most popular ML frameworks have this capability. If you are running a framework that doesn't do this, you can implement a mechanism to detect and restart failed nodes.