Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put zarr, tileDB and hub benchmarks in one file #534

Merged

Conversation

DebadityaPal
Copy link
Contributor

A preliminary attempt at standardizing the benchmarks. Also did some optimizations. Linked to #529

@github-actions
Copy link

github-actions bot commented Feb 4, 2021

Locust summary

Git references

Initial: 0128704
Terminal: d76b956

benchmarks/benchmark_tiledb_zarr_hub.py
Changes:
  • Name: time_tiledb
    Type: function
    Changed lines: 33
    Total lines: 33
    • Name: time_zarr
      Type: function
      Changed lines: 44
      Total lines: 44
      • Name: time_hub
        Type: function
        Changed lines: 20
        Total lines: 20

        @codecov
        Copy link

        codecov bot commented Feb 4, 2021

        Codecov Report

        Merging #534 (f90f962) into master (4d40553) will not change coverage.
        The diff coverage is n/a.

        Impacted file tree graph

        @@           Coverage Diff           @@
        ##           master     #534   +/-   ##
        =======================================
          Coverage   88.46%   88.46%           
        =======================================
          Files          52       52           
          Lines        3745     3745           
        =======================================
          Hits         3313     3313           
          Misses        432      432           

        Continue to review full report at Codecov.

        Legend - Click here to learn more
        Δ = absolute <relative> (impact), ø = not affected, ? = missing data
        Powered by Codecov. Last update 4d40553...f90f962. Read the comment docs.

        @haiyangdeperci
        Copy link
        Contributor

        Good job! Could you rename the file to benchmark_sequential_access.py?

        @DebadityaPal
        Copy link
        Contributor Author

        Yeah, I changed it to the more appropriate name.

        @haiyangdeperci
        Copy link
        Contributor

        @DebadityaPal Great! I'll review it soon.

        @mynameisvinn
        Copy link
        Contributor

        mynameisvinn commented Feb 4, 2021

        @DebadityaPal I'll defer to @haiyangdeperci but I see duplicate code that can be factored out.

        for example:

        with Timer("Time"):
            ...

        @DebadityaPal
        Copy link
        Contributor Author

        @mynameisvinn I have removed the duplicate code. Thanks for pointing it out.

        @mynameisvinn
        Copy link
        Contributor

        @DebadityaPal much better! There is nothing better than concise, well organized code :-)

        Copy link
        Contributor

        @haiyangdeperci haiyangdeperci left a comment

        Choose a reason for hiding this comment

        The reason will be displayed to describe this comment to others. Learn more.

        Is it absolutely necessary to put the the entire dataset in memory as below:

        ds["image"].compute().reshape(ds.shape[0], -1),
        

        The machine we are using for benchmarks has 160 GB of memory. This step requires 330 GiB of memory.

        MemoryError: Unable to allocate 330. GiB for an array with shape (1803460, 256, 256, 3) and data type uint8
        

        I've noted some other inefficiencies and I am working on them as well but this one is a blocker.

        @haiyangdeperci
        Copy link
        Contributor

        @DebadityaPal I am still having some memory issues after the patch. For instance:

        Traceback (most recent call last):
          File "Hub/benchmarks/benchmark_sequential_access.py", line 165, in <module>
            time_tiledb(dataset, batch_size, split)
          File "Hub/benchmarks/benchmark_sequential_access.py", line 90, in time_tiledb
            ds_tldb[batch * batch_size : (batch + 1) * batch_size] = ds_numpy
          File "tiledb/libtiledb.pyx", line 4909, in tiledb.libtiledb.DenseArrayImpl.__setitem__
          File "tiledb/libtiledb.pyx", line 416, in tiledb.libtiledb._write_array
          File "tiledb/libtiledb.pyx", line 480, in tiledb.libtiledb._raise_ctx_err
          File "tiledb/libtiledb.pyx", line 465, in tiledb.libtiledb._raise_tiledb_error
        tiledb.libtiledb.TileDBError: [TileDB::ChunkedBuffer] Error: Chunk write error; malloc() failed
        

        @DebadityaPal
        Copy link
        Contributor Author

        The current changes resolve the memory crash issue. The whole thing should take less than 15GB of ram now.

        @haiyangdeperci
        Copy link
        Contributor

        @DebadityaPal Great! Thanks for the update. I'll check it out right away

        @davidbuniat davidbuniat merged commit 11a7cb3 into activeloopai:master Feb 21, 2021
        Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
        Labels
        enhancement New feature or request
        Projects
        None yet
        Development

        Successfully merging this pull request may close these issues.

        4 participants