-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get issues from all samples in the dataset for semantic segmentation tasks without memory issues #842
Comments
Could you share the code you are trying to run? One thing you can try (I'm not sure if this works) is: Pass this as a memory-mapped object for
Another idea is if you're interested primarily in computing a label quality score for each image, i.e. via segmentation.rank.get_label_quality_scores, then you can actually run this method in small batches of images at a time to get their label-quality-scores independently of the rest of the images, ie:
where |
Thanks for the reply. I am not sure if using YOURFILE.zarr would work because I am only experimenting now with 4.5k out of sample predictions. However, I need to be able to apply it to 180k ideally. For the segmentation.rank.get_label_quality_scores, I actually did not know it works different from Here is how I read the labels and pred_probs and get the issues. I already use memmap for it.
And here here is the error I get with this code:
Before getting this error, I realized the Additionaly, I tried using
But the error I get is again related to memory but weird because it changes the shape of the pred_probs:
|
I can actually get the label quality scores and also issues without problem now. The memory issue occurs when I try to get the label issues directly using
What I extra do here is to specify a threshold value with 0.5. Do you think this is the similar way as in |
I have used .zarr compression and it compressed the np predictions arrays from 100MB to 5MB. I confirm that the compression is lossless. But I still could not insert it into cleanlab since it still needs 900GB storage space which is huge. |
Thanks for providing the additional information, your additional workaround sounds good to me and I'd proceed with that for your data. @vdlad is looking further into this issue. We suspect the bottleneck is in this line of code: cleanlab/cleanlab/segmentation/filter.py Line 124 in b2c28d2
where either |
The chunking approach I used was as follows:
As seen, I used chunks when getting the issues. However,
I think I will be able to get the results but I am not sure if this approach is the same as |
Hi there @vdlad. Have you been able to implement such approach? I have tried to use chunking method to get the results and save them on the fly. Even though I got results, they are worse in terms of finding label mistakes than the method I used above(
In summary, I concat the chunks from different folds which represent predictions from different models, and then insert into |
@hamzagorgulu For your data, I recommend simply doing:
where you'll want to visually estimate what is a good threshold This will not give the same results as:
which is intended to help additionally estimate the number of mislabeled images as well. |
Thanks for the answer. I ask to be sure about it. I dont have to give the out of sample predictions for the first approach right? I can iterate over folds one by one. |
Yep because the |
Tracking a fix for the original issue here: #863 |
I am trying to get issues from a 180k image dataset with 20 classes. But the prediction numpy array size per image is about 100 MB. So I am not able to get the predictions because I would need 18 TB space for this. I used fp16 bytes to reduce the size of the array but still so huge to handle.
Do you have any suggestion that would reduce the size of the predictions and still acceptable by cleanlab in terms of semantic segmentation?
The text was updated successfully, but these errors were encountered: