Make collectors and parallel envs fail gracefully when one process fails

### Problem

Currently, if only one process fails in a parallel env or in a parallel data collector, the whole program will display the error but it will require the user to interrupt it.
This is a problem as, in distributed settings, we'll have to check whether a node is alive, but we don't want a node to be marked as alive when it is actually idle. 

### Desired behaviour

The program should stop by itself.

### Proposed solution

Something along the line of the following program should be implemented:
```python
from torch import multiprocessing as mp
import torch

def fun():
    raise RuntimeError

def notfun():
    while True:
        a = 1 + 1

if __name__ == "__main__":
    mp.set_start_method("spawn")
    ps = []
    for i in range(10):
        if i % 2 == 0:
            p = mp.Process(target=fun, args=tuple())
        else:
            p = mp.Process(target=notfun, args=tuple())
        p.start()
        ps.append(p)

    terminate = False
    while True:
        for p in ps:
            if not p.is_alive():
                terminate = True
                for p in ps:
                    p.terminate()
                    break
        if terminate:
            raise RuntimeError

    # never reached
    for p in ps:
        p.join()
```
A listener in the collector / parallel env should look for programs that are unexpectedly terminated (we should not stop if a process is not alive because it has finished its job).

Tests should be written: they should test that (1) the program ends gracefully and (2) the error occurs on the process (i.e. the program has ended because of an error).

With the above example, a basic test could simply be a separate file executing this:
```python
import subprocess

subprocess.run(["python", "test.py"], timeout=10)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make collectors and parallel envs fail gracefully when one process fails #162

Problem

Desired behaviour

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development