Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing tables #65

Open
tischi opened this issue Oct 5, 2021 · 16 comments
Open

Storing tables #65

tischi opened this issue Oct 5, 2021 · 16 comments

Comments

@tischi
Copy link

tischi commented Oct 5, 2021

An outcome of this hackathon was that we would like to store tabular data in ome-zarr.

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

What do you think?

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-spatial-omics-hackathon/57337/28

@joshmoore
Copy link
Member

joshmoore commented Oct 5, 2021

toCSV will be easy enough. fromCSV will for many (if not most?) cases require extra metadata. There are a number of attempts to provide such metadata, e.g., https://specs.frictionlessdata.io/data-package/

@unidesigner
Copy link

Just out of curiosity, what is the reason of wanting to store tabular data in zarr, v.s. using some existing, optimized data formats, like Avro, Parquet, Sqlite etc. ?

@tischi
Copy link
Author

tischi commented Oct 5, 2021

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

@constantinpape
Copy link
Contributor

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

I think that compatibility with csv is desirable, but I am not sure how much can be done about this on the spec level.
I see this as more of a software than data standard question.

Just out of curiosity, what is the reason of wanting to store tabular data in zarr,

I would say the main reason is to provide all relevant data in the same data format and container.
Also note that AnnData, which the proposal is based on, is using zarr as storage already.

@tischi
Copy link
Author

tischi commented Oct 5, 2021

Also note that AnnData, which the proposal is based on, is using zarr as storage already.

My worry was that AnnData is much richer than a simple table and thus it may be difficult to map it onto a "simple table"? For example, both in Fiji and Napari there are ways to display a table. Do you think that one could also display AnnData in a "simple table viewer"?

@tischi tischi closed this as completed Oct 5, 2021
@tischi tischi reopened this Oct 5, 2021
@constantinpape
Copy link
Contributor

constantinpape commented Oct 5, 2021

My worry was that AnnData is much richer than a simple table

I don't think that AnnData is much richer than a simple table; at least not the subset that we are discussing here. But we have the major advantage that the dtype for each column is known ...

Do you think that one could also display AnnData in a "simple table viewer"?

Sure. Load X into a 2d array, load obs into a 2d array (this works in python where complex dtypes are easy, I don't know how you would do this in java, but you have the same problem in csv), concatenate along the first axis (=columns). This gives you a simple table. (The only question is what to do about var, but for simplicity it could just be ignored).

@unidesigner
Copy link

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

If found this comparison of zarr and parquet interesting. Especially choosing zarr over parquet for flexibility and the append option. https://waterdata.usgs.gov/blog/cloud_data/

I don't know all the formats in detail, but I imagine that it's not the first time this requirement comes up, and people have implemented solutions for this.

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

@tischi
Copy link
Author

tischi commented Oct 5, 2021

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

Good question! Typically one row in the table would correspond to a specific region (e.g. a point) in the image.

One use-case is that if people look at an image region one wants to load all the rows that correspond to this image region, e.g. in order to render something on the image.

We were thinking that an efficient image-coordinate to table-row mapping could be done by a tree where you enter the coordinate and the leaves of the tree are the table-row indices. However, how to serialize a tree into ome-zarr is something that we did not look into yet....

@unidesigner
Copy link

An R-Tree is probably what you are looking for. Not sure if this can be serialized in a way where it is not necessary to do a full-table scan to find out about the relevant rows.

The pandas docs has some interesting links as well for out-of-memory data formats/library, in particular the ecosystem page. It's not only about purely fetching data for visualization purposes, but also for efficient compute.

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-generation-user-friendly-smlm-processing-software-aka-thunderstorm-2-0/62289/20

@oeway
Copy link

oeway commented Apr 21, 2022

I have been testing octree-based spatial partitioning of point cloud (using a library called potree) for the shareloc.xyz platform. It allows us visualising large point cloud instantly (instead of downloading everything).

Here is a demo for visualising single-molecule localisation microscopy data:
https://imodpasteur.github.io/shareloc-utils/shareloc-potree-viewer.html?pointShape=circle&pointSizeType=adaptive&name=FFB000&load=https://imjoy-s3.pasteur.fr/public/pointclouds/7312e0.zip

When you open and zoom in, more point chunks will be loaded to the browser.

The tree is stored in a zip file and I used HTTP Range request to obtain the chunks.

As I understand, the tabular support we are discussing here won't allow storing point chunks organized in a tree yet, am I right?

cc @joshmoore

@joshmoore
Copy link
Member

Also cc: @kevinyamauchi and @ivirshup who are also discussing more on this this week.

I think you are right that there's no tree representation in the current discussions, but perhaps it's more a matter of AND rather than OR. That is, my understanding of the benefit of tabular layout is the ability to add annotations to the data. How would that work in the three representation? Does one need both?

For those in interested in taking a look, here are some brief details on the contents of @oeway's zip:

unzipped 7312e0.zip
cat sources.json | jq .
{
  "bounds": {
    "min": [
      1600.013671875,
      1633.791748046875,
      0
    ],
    "max": [
      41039.4140625,
      40804.4296875,
      0
    ]
  },
  "projection": "",
  "sources": [
    {
      "name": ".tmp.txt",
      "points": 16898373,
      "bounds": {
        "min": [
          1600.013671875,
          1633.791748046875,
          0
        ],
        "max": [
          41039.4140625,
          40804.4296875,
          0
        ]
      }
    }
  ]
}

cat cloud.js | jq .
{
  "version": "1.7",
  "octreeDir": "data",
  "projection": "",
  "points": 16898373,
  "boundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 41073.192138671875,
    "uz": 39439.400390625
  },
  "tightBoundingBox": {
    "lx": 1600.013671875,
    "ly": 1633.791748046875,
    "lz": 0,
    "ux": 41039.4140625,
    "uy": 40804.4296875,
    "uz": 0
  },
  "pointAttributes": [
    "POSITION_CARTESIAN",
    "COLOR_PACKED"
  ],
  "spacing": 341.55523681640625,
  "scale": 0.001,
  "hierarchyStepSize": 5
}

tree data/ | head
data/
└── r
    ├── 00060
    │   ├── r00060.bin
    │   └── r00060.hrc
    ├── 00062
    │   ├── r00062.bin
    │   └── r00062.hrc
    ├── 00064
    │   ├── r00064.bin

tree data/ | tail
    ├── r6642.bin
    ├── r6644.bin
    ├── r6646.bin
    ├── r666.bin
    ├── r6660.bin
    ├── r6662.bin
    ├── r6664.bin
    └── r6666.bin

760 directories, 3368 files

@kevinyamauchi
Copy link
Contributor

kevinyamauchi commented Apr 25, 2022

Hey @oeway ! Nice to see you here. Super cool that you're looking into rendering with spatial partitioning.

Indeed, we are currently focusing on storing points in a table and we are not planning to specify the format for spatial indices (for now). I think there are too many different strategies and the best one is likely application dependent, so I think it doesn't make sense to standardize that. I am definitely open to adding specs for some common spatial indices (e.g., octree, rtree) at some point once we have the basic table spec nailed down.

The pattern I would advocate for is that one queries the spatial index (e.g., octree) to look up the rows to fetch from the table for rendering. The table can be chunked along the rows, so this will allow points to be loaded lazily. Of course, the performance will depending on your chunking and the ordering of your table.

What do you think @oeway , does this sound reasonable?

@oeway
Copy link

oeway commented Apr 25, 2022

Hey,

@joshmoore Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

@kevinyamauchi Good to see you here too! Yes, I think it does! It will be certainly useful for the use case I am targeting (i.e. SMLM data), I can also see it will be super useful to store massive scatter plots, e.g. generated from scRNA-seq.

I just did a quick read in your existing PR. In practice, if we do want to support octree (that's the one mostly used for displaying LiDAR sensory data and has been proven to work with enven trillions of points for browser-based visualization), would it mean we just add additional tables to var? I would be happy to work with any of you to make a data loader to bridge with the potree viewer (the one I am using now).

@ivirshup
Copy link
Contributor

Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

I think it might make sense to consider spatial indexing as a property of the coordinate array. Especially if you have the same points represented in different coordinate spaces (e.g. slide by itself, slide aligned in stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants