Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-2038] Faster transforms #2094

Merged
merged 51 commits into from
Apr 20, 2023
Merged

[AL-2038] Faster transforms #2094

merged 51 commits into from
Apr 20, 2023

Conversation

FayazRahman
Copy link
Contributor

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

@FayazRahman FayazRahman marked this pull request as ready for review January 5, 2023 17:00
@codecov
Copy link

codecov bot commented Jan 6, 2023

Codecov Report

Patch coverage: 96.15% and project coverage change: +0.34 🎉

Comparison is base (1536f94) 88.15% compared to head (3be8d22) 88.49%.

❗ Current head 3be8d22 differs from pull request most recent head 357d0dd. Consider uploading reports for the commit 357d0dd to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2094      +/-   ##
==========================================
+ Coverage   88.15%   88.49%   +0.34%     
==========================================
  Files         291      291              
  Lines       32632    32853     +221     
==========================================
+ Hits        28767    29074     +307     
+ Misses       3865     3779      -86     
Flag Coverage Δ
unittests 88.49% <96.15%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
deeplake/core/polygon.py 89.10% <66.66%> (-0.37%) ⬇️
deeplake/core/transform/transform_tensor.py 91.95% <90.90%> (+13.04%) ⬆️
deeplake/util/transform.py 94.65% <97.10%> (+1.36%) ⬆️
deeplake/core/chunk_engine.py 94.27% <100.00%> (+0.59%) ⬆️
deeplake/core/linked_chunk_engine.py 91.70% <100.00%> (ø)
deeplake/core/tensor_link.py 91.15% <100.00%> (+4.35%) ⬆️
deeplake/core/transform/test_transform.py 99.36% <100.00%> (+0.04%) ⬆️
deeplake/core/transform/transform.py 98.52% <100.00%> (-0.02%) ⬇️
deeplake/core/transform/transform_dataset.py 98.64% <100.00%> (+5.31%) ⬆️
deeplake/util/class_label.py 98.83% <100.00%> (+0.01%) ⬆️
... and 1 more

... and 128 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

deeplake/core/transform/transform_dataset.py Show resolved Hide resolved
pg_callback,
ignore_errors,
):
n = len(data_slice)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Either create several methods or a class. One solution could be to initialize an object replace each if statement with a method. To be honest I prefer more creating object for this method.
  2. Nested try catch block seems complicated, can you put some comments and elaborate more on why we need a nested try catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function only calls other functions and does almost nothing else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree with it, you are doing bunch of checks and try catches, it is too much for a single function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pls add comments to try catch blocks, so that we will understand what it tries to catch

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are very few checks and try catch blocks are there for a reason. This functionn also serves only one person. please try to elaborate more what you mean.

elif isinstance(value, Sample):
value = value.array
return value
values: List[Any] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit complicated logic, can we separate it?

For instance we can keep for loop, create a dict:

INSTANCE_TYPE_TO_APPEND_METHOD = {
    Sample: self.append_sample,
    LinkedSample: self.append_tensor_or_partial_sample,
    Tensor: self.append_tensor_or_partial_sample,
    type(None): self.append_tensor_or_partial_sample,
    PartialSample:  self.append_tensor_or_partial_sample,
    LinkedTiledSample: self.append_tensor_or_partial_sample,
}

put this right under TransformTesnor Class before init,

and define two methods:

def self.append_sample(self, values, item):
    values.append(item.array)
  
def self.append_sample(self, values, item):
    values.append(np.asarray(item))
    
def append_item(self, values, item):
    values.append(item)

And finally this code block will look as follows:

for item in items:
    append_method = self.INSTANCE_TYPE_TO_APPEND_METHOD[type(item)]
    append_method(values, item)

    if squeeze:
         values = values[0]
     return values    

This way we will make our code more isolated and easy to understand

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

slice_list=new_slice_list,
)
idx = self.idx
if self.numpy_only:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here triple nested if statements are complicated, can we separate it?

We can create several methods. The goal is to make it more readable, and more explicit on what happens if some if statements are not satisfied.

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

incoming_shape, self._ndim
)
return self.items.append(items)
if self.numpy_only:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can create a move the logic inside of the if statement to a new method. Also try to make if else statements more readable. Quite hard to grasp what is going on here.

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.



def write_sample_to_transform_dataset(out, transform_dataset):
if not is_empty_transform_dataset(out):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complicated logic for a single function. Either create several methods or a class where you will isolate all if else statements and for loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is only 10 lines of code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 for loops, we need to isolate it... and if else statements

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function serves a single purpose which is to write a sample to a transform dataset. Unnecessarily separating a 10-line function for the sake of doing a for loop in another function only makes it harder to follow.

)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a separate method for try catch block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are only 3 lines of code in the try block

Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

try:
transform_dataset.set_start_input_idx(i)

try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try will catch error while a sample is being transformed. it is easier to handle compared to an error after a sample has been sent to the chunk engine.

try:
out = transform_sample(sample, pipeline, tensors)
except SampleAppendError as e:
if ignore_errors:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will only ignore SampleAppendError if ignore_errors is True

if pd and isinstance(data_slice, pd.DataFrame)
else data_slice
):
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try will catch errors from chunk engine

pg_callback,
ignore_errors,
):
n = len(data_slice)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree with it, you are doing bunch of checks and try catches, it is too much for a single function

pg_callback,
ignore_errors,
):
n = len(data_slice)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pls add comments to try catch blocks, so that we will understand what it tries to catch

)
try:
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.



def write_sample_to_transform_dataset(out, transform_dataset):
if not is_empty_transform_dataset(out):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 for loops, we need to isolate it... and if else statements

incoming_shape, self._ndim
)
return self.items.append(items)
if self.numpy_only:
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.



def write_sample_to_transform_dataset(out, transform_dataset):
if not is_empty_transform_dataset(out):
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

pg_callback,
ignore_errors,
):
n = len(data_slice)
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

slice_list=new_slice_list,
)
idx = self.idx
if self.numpy_only:
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

elif isinstance(value, Sample):
value = value.array
return value
values: List[Any] = []
Copy link
Contributor

@adolkhan adolkhan Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.

deeplake/core/transform/transform_dataset.py Show resolved Hide resolved
@adolkhan adolkhan self-requested a review April 18, 2023 07:45
@FayazRahman FayazRahman merged commit eebde70 into main Apr 20, 2023
@FayazRahman FayazRahman deleted the fy_new_transform branch April 20, 2023 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants