-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor KNN graph handling and outlier detection in issue managers #1155
Merged
elisno
merged 14 commits into
cleanlab:master
from
elisno:refactor-knn-graph-recomputation
Jun 24, 2024
Merged
Refactor KNN graph handling and outlier detection in issue managers #1155
elisno
merged 14 commits into
cleanlab:master
from
elisno:refactor-knn-graph-recomputation
Jun 24, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pattern repeats throughout the issue_managers package
Refactored the internal method to handle cases where neither 'features' nor 'pred_probs' are available without raising an error. The responsibility for input validation is now delegated to the `set_knn_graph` helper function.
- Use set_knn_graph for simplified setup from helper module. - Validation Check: Added a check to ensure the provided KNN graph (not the one stored in statistics or computed) has at least as many neighbors as required (self.k). Raises a ValueError with a clear message if the condition is not met. - Test Updates: Modified tests to align with the new KNN graph handling logic and added test cases to validate the new error handling. Turns out that previously, this issue manager had the side-effect of decreasing self.k, which none of the others ever did. It's far more reasonable to raise an error in this case for consistent behavior. This should only happen when the knn graph is explicitly pass in by the user during this check.
can you ask @sanjanag for review? Thanks! |
This commit introduces several important changes to the OutlierIssueManager: 1. Simplified API: Replace NearestNeighbors object with direct k, t, and metric parameters in the constructor. This reduces dependency on OutOfDistribution and aligns with Datalab's approach. 2. Enhanced find_issues method: Restructured to better handle different input scenarios (kNN graph, features, pred_probs). Added comments to explain the logic and decision-making process. 3. Improved kNN graph handling: Introduced _knn_graph_works and set_knn_graph helper functions for more robust graph management. 4. Updated test cases: Modified to reflect new API and behavior, particularly in kNN graph verification. 5. Removed metric storage for user-provided kNN graphs: Users are now responsible for tracking their own metrics when providing a pre-computed graph. 6. Future-proofing: Prepared for potential future storage of NearestNeighbors objects in info["statistics"] for global reuse across issue managers. These changes aim to make the outlier detection process more flexible, easier to understand, and consistent with Datalab's design principles. They also lay the groundwork for better documentation and user guidance in future updates. Note: The removal of the NearestNeighbors check in tests allows for more flexible handling of kNN objects, either user-constructed or potentially stored in global statistics in the future.
elisno
changed the title
Refactor logic for deciding on knn graph recomputation in issue managers
Refactor KNN graph handling and outlier detection in issue managers
Jun 21, 2024
The test_default_issue_types in the TestDataMonitorReuseStatisticInfo class is expected to fail due to a simplified API of the outlier issue manager in Datalab. DataMonitor is in experimental, where no public API has been finalized.
sanjanag
reviewed
Jun 21, 2024
sanjanag
reviewed
Jun 21, 2024
sanjanag
reviewed
Jun 21, 2024
sanjanag
reviewed
Jun 21, 2024
sanjanag
reviewed
Jun 21, 2024
sanjanag
reviewed
Jun 21, 2024
sanjanag
approved these changes
Jun 21, 2024
LGTM, merge whenever the comments are in. |
…if_knn_graph_storage_should_be_skipped.
- Check for graph in both kwargs and statistics - Prevents silent failures when using pre-computed or exisitng knn graphs
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
🎯 Purpose: This PR refactors the handling of KNN graphs in various issue managers, including a significant update to the outlier issue manager. The changes aim to improve flexibility, consistency, and error handling across Datalab's issue detection mechanisms.
Key Changes
Introduced helper functions for KNN graph processing in knn_graph_helpers.py.
Replaced existing methods with these helpers in multiple issue managers.
Added checks to validate the number of neighbors in stored KNN graphs,
user-provided ones should not require such validation.
Significantly refactored the OutlierIssueManager:
Simplified API by replacing NearestNeighbors object with direct parameters.
Restructured find_issues method for better handling of different input scenarios.
Improved KNN graph management and outlier score computation.
Updated related test cases to reflect new (intended) behavior.
Impact
🌐 Areas Affected:
👥 Who’s Affected:
Users who expect Datalab to automatically lower the number of neighbors
k
will now receive an error.The OutlierIssueManager now has a more consistent API with other issue managers.
Users providing pre-computed KNN graphs are now responsible for tracking their own metrics and values of k.
Testing
🔍 Testing Done:
tests/datalab/
pass locally.Future Considerations
Notes