Skip to content

Make collectors and parallel envs fail gracefully when one process fails #162

Closed
@vmoens

Description

Problem

Currently, if only one process fails in a parallel env or in a parallel data collector, the whole program will display the error but it will require the user to interrupt it.
This is a problem as, in distributed settings, we'll have to check whether a node is alive, but we don't want a node to be marked as alive when it is actually idle.

Desired behaviour

The program should stop by itself.

Proposed solution

Something along the line of the following program should be implemented:

from torch import multiprocessing as mp
import torch

def fun():
    raise RuntimeError

def notfun():
    while True:
        a = 1 + 1

if __name__ == "__main__":
    mp.set_start_method("spawn")
    ps = []
    for i in range(10):
        if i % 2 == 0:
            p = mp.Process(target=fun, args=tuple())
        else:
            p = mp.Process(target=notfun, args=tuple())
        p.start()
        ps.append(p)

    terminate = False
    while True:
        for p in ps:
            if not p.is_alive():
                terminate = True
                for p in ps:
                    p.terminate()
                    break
        if terminate:
            raise RuntimeError

    # never reached
    for p in ps:
        p.join()

A listener in the collector / parallel env should look for programs that are unexpectedly terminated (we should not stop if a process is not alive because it has finished its job).

Tests should be written: they should test that (1) the program ends gracefully and (2) the error occurs on the process (i.e. the program has ended because of an error).

With the above example, a basic test could simply be a separate file executing this:

import subprocess

subprocess.run(["python", "test.py"], timeout=10)

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions