Storing tables #65

tischi · 2021-10-05T06:48:07Z

An outcome of this hackathon was that we would like to store tabular data in ome-zarr.

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

What do you think?

imagesc-bot · 2021-10-05T06:48:40Z

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-spatial-omics-hackathon/57337/28

joshmoore · 2021-10-05T07:11:18Z

toCSV will be easy enough. fromCSV will for many (if not most?) cases require extra metadata. There are a number of attempts to provide such metadata, e.g., https://specs.frictionlessdata.io/data-package/

unidesigner · 2021-10-05T08:40:22Z

Just out of curiosity, what is the reason of wanting to store tabular data in zarr, v.s. using some existing, optimized data formats, like Avro, Parquet, Sqlite etc. ?

tischi · 2021-10-05T08:50:23Z

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

constantinpape · 2021-10-05T09:01:23Z

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

I think that compatibility with csv is desirable, but I am not sure how much can be done about this on the spec level.
I see this as more of a software than data standard question.

Just out of curiosity, what is the reason of wanting to store tabular data in zarr,

I would say the main reason is to provide all relevant data in the same data format and container.
Also note that AnnData, which the proposal is based on, is using zarr as storage already.

tischi · 2021-10-05T09:55:17Z

Also note that AnnData, which the proposal is based on, is using zarr as storage already.

My worry was that AnnData is much richer than a simple table and thus it may be difficult to map it onto a "simple table"? For example, both in Fiji and Napari there are ways to display a table. Do you think that one could also display AnnData in a "simple table viewer"?

constantinpape · 2021-10-05T10:09:58Z

My worry was that AnnData is much richer than a simple table

I don't think that AnnData is much richer than a simple table; at least not the subset that we are discussing here. But we have the major advantage that the dtype for each column is known ...

Do you think that one could also display AnnData in a "simple table viewer"?

Sure. Load X into a 2d array, load obs into a 2d array (this works in python where complex dtypes are easy, I don't know how you would do this in java, but you have the same problem in csv), concatenate along the first axis (=columns). This gives you a simple table. (The only question is what to do about var, but for simplicity it could just be ignored).

unidesigner · 2021-10-05T12:18:13Z

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

If found this comparison of zarr and parquet interesting. Especially choosing zarr over parquet for flexibility and the append option. https://waterdata.usgs.gov/blog/cloud_data/

I don't know all the formats in detail, but I imagine that it's not the first time this requirement comes up, and people have implemented solutions for this.

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

tischi · 2021-10-05T12:41:29Z

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

Good question! Typically one row in the table would correspond to a specific region (e.g. a point) in the image.

One use-case is that if people look at an image region one wants to load all the rows that correspond to this image region, e.g. in order to render something on the image.

We were thinking that an efficient image-coordinate to table-row mapping could be done by a tree where you enter the coordinate and the leaves of the tree are the table-row indices. However, how to serialize a tree into ome-zarr is something that we did not look into yet....

unidesigner · 2021-10-05T15:16:44Z

An R-Tree is probably what you are looking for. Not sure if this can be serialized in a way where it is not necessary to do a full-table scan to find out about the relevant rows.

The pandas docs has some interesting links as well for out-of-memory data formats/library, in particular the ecosystem page. It's not only about purely fetching data for visualization purposes, but also for efficient compute.

imagesc-bot · 2022-01-26T11:04:38Z

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-generation-user-friendly-smlm-processing-software-aka-thunderstorm-2-0/62289/20

oeway · 2022-04-21T14:21:43Z

I have been testing octree-based spatial partitioning of point cloud (using a library called potree) for the shareloc.xyz platform. It allows us visualising large point cloud instantly (instead of downloading everything).

Here is a demo for visualising single-molecule localisation microscopy data:
https://imodpasteur.github.io/shareloc-utils/shareloc-potree-viewer.html?pointShape=circle&pointSizeType=adaptive&name=FFB000&load=https://imjoy-s3.pasteur.fr/public/pointclouds/7312e0.zip

When you open and zoom in, more point chunks will be loaded to the browser.

The tree is stored in a zip file and I used HTTP Range request to obtain the chunks.

As I understand, the tabular support we are discussing here won't allow storing point chunks organized in a tree yet, am I right?

cc @joshmoore

joshmoore · 2022-04-25T08:18:07Z

Also cc: @kevinyamauchi and @ivirshup who are also discussing more on this this week.

I think you are right that there's no tree representation in the current discussions, but perhaps it's more a matter of AND rather than OR. That is, my understanding of the benefit of tabular layout is the ability to add annotations to the data. How would that work in the three representation? Does one need both?

For those in interested in taking a look, here are some brief details on the contents of @oeway's zip:

unzipped 7312e0.zip

cat sources.json | jq .
{
  "bounds": {
    "min": [
      1600.013671875,
      1633.791748046875,
      0
    ],
    "max": [
      41039.4140625,
      40804.4296875,
      0
    ]
  },
  "projection": "",
  "sources": [
    {
      "name": ".tmp.txt",
      "points": 16898373,
      "bounds": {
        "min": [
          1600.013671875,
          1633.791748046875,
          0
        ],
        "max": [
          41039.4140625,
          40804.4296875,
          0
        ]
      }
    }
  ]
}

cat cloud.js | jq .
{
  "version": "1.7",
  "octreeDir": "data",
  "projection": "",
  "points": 16898373,
  "boundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 41073.192138671875,
    "uz": 39439.400390625
  },
  "tightBoundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 40804.4296875,
    "uz": 0
  },
  "pointAttributes": [
    "POSITION_CARTESIAN",
    "COLOR_PACKED"
  ],
  "spacing": 341.55523681640625,
  "scale": 0.001,
  "hierarchyStepSize": 5
}

tree data/ | head
data/
└── r
    ├── 00060
    │   ├── r00060.bin
    │   └── r00060.hrc
    ├── 00062
    │   ├── r00062.bin
    │   └── r00062.hrc
    ├── 00064
    │   ├── r00064.bin

tree data/ | tail
    ├── r6642.bin
    ├── r6644.bin
    ├── r6646.bin
    ├── r666.bin
    ├── r6660.bin
    ├── r6662.bin
    ├── r6664.bin
    └── r6666.bin

760 directories, 3368 files

kevinyamauchi · 2022-04-25T08:30:48Z

Hey @oeway ! Nice to see you here. Super cool that you're looking into rendering with spatial partitioning.

Indeed, we are currently focusing on storing points in a table and we are not planning to specify the format for spatial indices (for now). I think there are too many different strategies and the best one is likely application dependent, so I think it doesn't make sense to standardize that. I am definitely open to adding specs for some common spatial indices (e.g., octree, rtree) at some point once we have the basic table spec nailed down.

The pattern I would advocate for is that one queries the spatial index (e.g., octree) to look up the rows to fetch from the table for rendering. The table can be chunked along the rows, so this will allow points to be loaded lazily. Of course, the performance will depending on your chunking and the ordering of your table.

What do you think @oeway , does this sound reasonable?

oeway · 2022-04-25T09:01:26Z

Hey,

@joshmoore Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

@kevinyamauchi Good to see you here too! Yes, I think it does! It will be certainly useful for the use case I am targeting (i.e. SMLM data), I can also see it will be super useful to store massive scatter plots, e.g. generated from scRNA-seq.

I just did a quick read in your existing PR. In practice, if we do want to support octree (that's the one mostly used for displaying LiDAR sensory data and has been proven to work with enven trillions of points for browser-based visualization), would it mean we just add additional tables to var? I would be happy to work with any of you to make a data loader to bridge with the potree viewer (the one I am using now).

ivirshup · 2022-04-26T12:40:45Z

Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

I think it might make sense to consider spatial indexing as a property of the coordinate array. Especially if you have the same points represented in different coordinate spaces (e.g. slide by itself, slide aligned in stack).

tischi closed this as completed Oct 5, 2021

tischi reopened this Oct 5, 2021

thewtex mentioned this issue Jan 31, 2022

Finalize axes & initial transformation #85

Merged

bogovicj mentioned this issue Feb 18, 2022

Transformation types #101

Open

thewtex mentioned this issue Apr 12, 2022

Zarr pyramid for massive point cloud Kitware/itk-vtk-viewer#365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing tables #65

Storing tables #65

tischi commented Oct 5, 2021

imagesc-bot commented Oct 5, 2021

joshmoore commented Oct 5, 2021 •

edited

Loading

unidesigner commented Oct 5, 2021

tischi commented Oct 5, 2021

constantinpape commented Oct 5, 2021

tischi commented Oct 5, 2021

constantinpape commented Oct 5, 2021 •

edited

Loading

unidesigner commented Oct 5, 2021

tischi commented Oct 5, 2021

unidesigner commented Oct 5, 2021

imagesc-bot commented Jan 26, 2022

oeway commented Apr 21, 2022

joshmoore commented Apr 25, 2022

kevinyamauchi commented Apr 25, 2022 •

edited

Loading

oeway commented Apr 25, 2022 •

edited

Loading

ivirshup commented Apr 26, 2022

Storing tables #65

Storing tables #65

Comments

tischi commented Oct 5, 2021

imagesc-bot commented Oct 5, 2021

joshmoore commented Oct 5, 2021 • edited Loading

unidesigner commented Oct 5, 2021

tischi commented Oct 5, 2021

constantinpape commented Oct 5, 2021

tischi commented Oct 5, 2021

constantinpape commented Oct 5, 2021 • edited Loading

unidesigner commented Oct 5, 2021

tischi commented Oct 5, 2021

unidesigner commented Oct 5, 2021

imagesc-bot commented Jan 26, 2022

oeway commented Apr 21, 2022

joshmoore commented Apr 25, 2022

kevinyamauchi commented Apr 25, 2022 • edited Loading

oeway commented Apr 25, 2022 • edited Loading

ivirshup commented Apr 26, 2022

joshmoore commented Oct 5, 2021 •

edited

Loading

constantinpape commented Oct 5, 2021 •

edited

Loading

kevinyamauchi commented Apr 25, 2022 •

edited

Loading

oeway commented Apr 25, 2022 •

edited

Loading