Does parallel writing with MPI actually permit simultaneous writing to H5 from different processes? #2531

bskubi · 2024-11-27T02:27:43Z

I would like to disambiguate two potential meanings of "parallel writing" with MPI. Here is an example scenario. Can you let me know which behavior to expect from h5py + MPI?

Scenario: A pre-existing h5 file with 50 datasets exist, each having space preallocated. There are 50 different threads running, and in the exact same instant, each attempts to write a numpy array to one of the 50 datasets.

In this scenario, which of the following would occur?

All 50 writes happen simultaneously, resulting in an approximately 50x speedup for writing the data to the file
Even though the 50 writes are initiated by distinct processes, they still happen sequentially in time, resulting in no appreciable speedup for writing the data to the file

The reason I ask is that I have an ingestion script that uses h5py+MPI to read in a number of datasets into a single H5 file using h5py+MPI. Running with 1 process or 75 processes produces identical output. However, the 75-process version is only about 3x faster than the 1-process version, and the single line on which I write the in-memory numpy array to the precreated and presized dataset in the single output H5 file takes virtually the entire runtime.

This makes me suspect that under the hood, even though h5py+MPI can collect "requests" to write to datasets from multiple independent processes in parallel, each individual write to the h5 file still has to happen sequentially in reality. So if N parallel processes are simultaneously trying to do their writes, they won't actually realize an Nx speedup if the bottleneck is actually writing to the H5 file.

Can you let me know if I'm understanding h5py+MPI's parallel write correctly?

Thank you.

takluyver · 2024-12-05T10:24:51Z

I believe that with MPI you can genuinely do the writes in parallel in the sense that HDF5 doesn't need to serialise the operations of different processes. But of course there are still limitations in the operating system and the hardware, so it's unlikely that using 75 processes is ever going to be 75x faster.

takluyver added the MPI Bugs related to MPI label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does parallel writing with MPI actually permit simultaneous writing to H5 from different processes? #2531

Does parallel writing with MPI actually permit simultaneous writing to H5 from different processes? #2531

bskubi commented Nov 27, 2024

takluyver commented Dec 5, 2024

Does parallel writing with MPI actually permit *simultaneous* writing to H5 from different processes? #2531

Does parallel writing with MPI actually permit *simultaneous* writing to H5 from different processes? #2531

Comments

bskubi commented Nov 27, 2024

takluyver commented Dec 5, 2024

Does parallel writing with MPI actually permit simultaneous writing to H5 from different processes? #2531

Does parallel writing with MPI actually permit simultaneous writing to H5 from different processes? #2531