Description
We have been internally evaluating the use of SNIRF as a native output format for Gowerlabs' Lumo system.
Lumo is a high density system, and our full head adult caps contain 54 modules, each with 3 dual-wavelength sources and 4 detectors. We are able to provide a dense output, which results in (54 x 4 x 54 x 6 = ) circa 70k channels.
The use of an HDF5 group per channel descriptor (e.g. /data1/measurementList{i}
) appears to incur significant overhead. For example, a SNIRF file containing only metadata (no channel data) for a full head system system amounts to ~200MiB, or ~3KiB per channel. The actual information content of each descriptor (containing only the required fields plus module indices) amounts to only (7 x 4 = ) 28 bytes, so this is an overhead of approximately 99%.
Our results appear vaguely consistent with this analysis:
The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.
Evidently the size of the metadata grows linearly with the number of channels, as does the data rate of the channel time series, and hence for longer recordings the size of the metadata becomes proportionally smaller. However in absolute terms we find that (with appropriate chunking and online compression) the metadata corresponds to around four minutes of compressed raw channel data. Given the length of a typical measurement session, the overhead remains significant.
I appreciate that the majority of systems (such as those of the manufacturers listed on the SNIRF specification page) are of a much lower density than Lumo, and that even high density systems often produce sparse data, but evidently the trend is towards increasing density and the number of wavelengths. Our future products would, based on the current SNIRF specification, generate over 0.5GiB of metadata.
- Have you previously considered this?
- Might it be possible to use an array of a compound datatype to represent channel descriptors?
- Do you have any alternative suggestions as to how we might reduce this overhead?