-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AL-2038] Faster transforms #2094
Conversation
…fy_new_transform
…fy_new_transform
…fy_new_transform
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2094 +/- ##
==========================================
+ Coverage 88.15% 88.49% +0.34%
==========================================
Files 291 291
Lines 32632 32853 +221
==========================================
+ Hits 28767 29074 +307
+ Misses 3865 3779 -86
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 128 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
…fy_new_transform
…fy_new_transform
…fy_new_transform
pg_callback, | ||
ignore_errors, | ||
): | ||
n = len(data_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Either create several methods or a class. One solution could be to initialize an object replace each if statement with a method. To be honest I prefer more creating object for this method.
- Nested try catch block seems complicated, can you put some comments and elaborate more on why we need a nested try catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function only calls other functions and does almost nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't agree with it, you are doing bunch of checks and try catches, it is too much for a single function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also pls add comments to try catch blocks, so that we will understand what it tries to catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are very few checks and try catch blocks are there for a reason. This functionn also serves only one person. please try to elaborate more what you mean.
elif isinstance(value, Sample): | ||
value = value.array | ||
return value | ||
values: List[Any] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit complicated logic, can we separate it?
For instance we can keep for loop, create a dict:
INSTANCE_TYPE_TO_APPEND_METHOD = {
Sample: self.append_sample,
LinkedSample: self.append_tensor_or_partial_sample,
Tensor: self.append_tensor_or_partial_sample,
type(None): self.append_tensor_or_partial_sample,
PartialSample: self.append_tensor_or_partial_sample,
LinkedTiledSample: self.append_tensor_or_partial_sample,
}
put this right under TransformTesnor Class before init,
and define two methods:
def self.append_sample(self, values, item):
values.append(item.array)
def self.append_sample(self, values, item):
values.append(np.asarray(item))
def append_item(self, values, item):
values.append(item)
And finally this code block will look as follows:
for item in items:
append_method = self.INSTANCE_TYPE_TO_APPEND_METHOD[type(item)]
append_method(values, item)
if squeeze:
values = values[0]
return values
This way we will make our code more isolated and easy to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
slice_list=new_slice_list, | ||
) | ||
idx = self.idx | ||
if self.numpy_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here triple nested if statements are complicated, can we separate it?
We can create several methods. The goal is to make it more readable, and more explicit on what happens if some if statements are not satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
incoming_shape, self._ndim | ||
) | ||
return self.items.append(items) | ||
if self.numpy_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can create a move the logic inside of the if statement to a new method. Also try to make if else statements more readable. Quite hard to grasp what is going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
|
||
|
||
def write_sample_to_transform_dataset(out, transform_dataset): | ||
if not is_empty_transform_dataset(out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complicated logic for a single function. Either create several methods or a class where you will isolate all if else statements and for loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is only 10 lines of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 for loops, we need to isolate it... and if else statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function serves a single purpose which is to write a sample to a transform dataset. Unnecessarily separating a 10-line function for the sake of doing a for loop in another function only makes it harder to follow.
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a separate method for try catch block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there are only 3 lines of code in the try block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
try: | ||
transform_dataset.set_start_input_idx(i) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this try will catch error while a sample is being transformed. it is easier to handle compared to an error after a sample has been sent to the chunk engine.
try: | ||
out = transform_sample(sample, pipeline, tensors) | ||
except SampleAppendError as e: | ||
if ignore_errors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will only ignore SampleAppendError
if ignore_errors
is True
if pd and isinstance(data_slice, pd.DataFrame) | ||
else data_slice | ||
): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try will catch errors from chunk engine
pg_callback, | ||
ignore_errors, | ||
): | ||
n = len(data_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't agree with it, you are doing bunch of checks and try catches, it is too much for a single function
pg_callback, | ||
ignore_errors, | ||
): | ||
n = len(data_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also pls add comments to try catch blocks, so that we will understand what it tries to catch
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
|
||
|
||
def write_sample_to_transform_dataset(out, transform_dataset): | ||
if not is_empty_transform_dataset(out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 for loops, we need to isolate it... and if else statements
incoming_shape, self._ndim | ||
) | ||
return self.items.append(items) | ||
if self.numpy_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
|
||
|
||
def write_sample_to_transform_dataset(out, transform_dataset): | ||
if not is_empty_transform_dataset(out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
pg_callback, | ||
ignore_errors, | ||
): | ||
n = len(data_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
slice_list=new_slice_list, | ||
) | ||
idx = self.idx | ||
if self.numpy_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
elif isinstance(value, Sample): | ||
value = value.array | ||
return value | ||
values: List[Any] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, we need to isolate as much as possible, single method should either do try catch or do for loop with other function calls. It is called single responsibility principle. You can take a look at it, if use this principle our code base will be much cleaner, isolated and scalable.
🚀 🚀 Pull Request
Checklist:
coverage-rate
upChanges