Skip to content

Commit

Permalink
[AL-2078] Switch random split doc section (activeloopai#2064)
Browse files Browse the repository at this point in the history
* fix

* changed docstring

* update

* fix

* Update deeplake/core/dataset/dataset.py

* add indra import in random split

* fix

Co-authored-by: Fayaz Rahman <fayazrahman4u@gmail.com>
  • Loading branch information
AbhinavTuli and FayazRahman authored Dec 21, 2022
1 parent 51c8a1b commit 384b074
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 1 deletion.
2 changes: 2 additions & 0 deletions deeplake/api/tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
assert_array_lists_equal,
is_opt_true,
get_dummy_data_path,
requires_libdeeplake,
)
from deeplake.tests.storage_fixtures import enabled_remote_storages
from deeplake.core.storage import GCSProvider
Expand Down Expand Up @@ -2252,6 +2253,7 @@ def test_iter_warning(local_ds):
ds.abc[10]


@requires_libdeeplake
def test_random_split(local_ds):
with local_ds as ds:
ds.create_tensor("label")
Expand Down
3 changes: 2 additions & 1 deletion deeplake/core/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3930,8 +3930,9 @@ def max_view(self):
)

def random_split(self, lengths: Sequence[Union[int, float]]):
"""Splits the dataset into non-overlapping new datasets of given lengths.
"""Splits the dataset into non-overlapping :class:`~deeplake.core.dataset.Dataset` objects of given lengths.
If a list of fractions that sum up to 1 is given, the lengths will be computed automatically as floor(frac * len(dataset)) for each fraction provided.
The split generated is only performant with enterprise dataloader which can be installed with ``pip install deeplake[enterprise]``.
After computing the lengths, if there are any remainders, 1 count will be distributed in round-robin fashion to the lengths until there are no remainders left.
Expand Down
3 changes: 3 additions & 0 deletions deeplake/util/scheduling.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ def create_fetching_schedule(dataset, primary_tensor_name, shuffle_within_chunks


def create_random_split_views(dataset, lengths):
from deeplake.enterprise.convert_to_libdeeplake import import_indra_api

import_indra_api()
if math.isclose(sum(lengths), 1) and sum(lengths) <= 1:
subset_lengths: List[int] = []
for i, frac in enumerate(lengths):
Expand Down
15 changes: 15 additions & 0 deletions docs/source/Random-Split.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.. currentmodule:: deeplake.core.dataset

Random Split
=======

Splits the dataset into non overlapping new datasets of given lengths.
The resulting datasets are generated in such a way that when creating a dataloader from the view and training on it,
the performance impact is minimal. Using the outputs of this function with .pytorch method of dataset (instead of .dataloader) may result in poor performance.
See the ``random_split`` method on how to use this feature:

.. autosummary::
:toctree:
:nosignatures:

Dataset.random_split
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Deep Lake is an open-source database for AI.
Dataloader <Dataloader>
Sampler <Sampler>
Tensor Query Language <Tensor-Query-Language>
Random Split <Random-Split>

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit 384b074

Please sign in to comment.