Make collectors and parallel envs fail gracefully when one process fails #162
Description
Problem
Currently, if only one process fails in a parallel env or in a parallel data collector, the whole program will display the error but it will require the user to interrupt it.
This is a problem as, in distributed settings, we'll have to check whether a node is alive, but we don't want a node to be marked as alive when it is actually idle.
Desired behaviour
The program should stop by itself.
Proposed solution
Something along the line of the following program should be implemented:
from torch import multiprocessing as mp
import torch
def fun():
raise RuntimeError
def notfun():
while True:
a = 1 + 1
if __name__ == "__main__":
mp.set_start_method("spawn")
ps = []
for i in range(10):
if i % 2 == 0:
p = mp.Process(target=fun, args=tuple())
else:
p = mp.Process(target=notfun, args=tuple())
p.start()
ps.append(p)
terminate = False
while True:
for p in ps:
if not p.is_alive():
terminate = True
for p in ps:
p.terminate()
break
if terminate:
raise RuntimeError
# never reached
for p in ps:
p.join()
A listener in the collector / parallel env should look for programs that are unexpectedly terminated (we should not stop if a process is not alive because it has finished its job).
Tests should be written: they should test that (1) the program ends gracefully and (2) the error occurs on the process (i.e. the program has ended because of an error).
With the above example, a basic test could simply be a separate file executing this:
import subprocess
subprocess.run(["python", "test.py"], timeout=10)