Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor KNN graph handling and outlier detection in issue managers #1155

Merged
merged 14 commits into from
Jun 24, 2024

Conversation

elisno
Copy link
Member

@elisno elisno commented Jun 20, 2024

Summary

🎯 Purpose: This PR refactors the handling of KNN graphs in various issue managers, including a significant update to the outlier issue manager. The changes aim to improve flexibility, consistency, and error handling across Datalab's issue detection mechanisms.

Key Changes

  1. Introduced helper functions for KNN graph processing in knn_graph_helpers.py.

  2. Replaced existing methods with these helpers in multiple issue managers.

  3. Added checks to validate the number of neighbors in stored KNN graphs,
    user-provided ones should not require such validation.

  4. Significantly refactored the OutlierIssueManager:

    • Simplified API by replacing NearestNeighbors object with direct parameters.

    • Restructured find_issues method for better handling of different input scenarios.

    • Improved KNN graph management and outlier score computation.

    • Updated related test cases to reflect new (intended) behavior.

Impact

🌐 Areas Affected:

  • cleanlab/datalab/internal/issue_manager/
    ├── data_valuation.py
    ├── duplicate.py
    ├── noniid.py
    ├── outlier.py
    └── underperforming_group.py
    
  • Added knn_graph_helpers.py for shared KNN graph functionality.

👥 Who’s Affected:

  • Users who expect Datalab to automatically lower the number of neighbors k will now receive an error.

  • The OutlierIssueManager now has a more consistent API with other issue managers.

  • Users providing pre-computed KNN graphs are now responsible for tracking their own metrics and values of k.

Testing

🔍 Testing Done:

  • Updated existing tests to reflect changes in KNN graph handling and outlier detection.
  • Added new tests to ensure proper validation and error handling for insufficient KNN graphs.
  • Verified that all tests in tests/datalab/ pass locally.

Future Considerations

  • The changes to the OutlierIssueManager lay the groundwork for potential future optimizations, such as storing NearestNeighbors objects in info["statistics"] for global reuse across issue managers.
  • Any edge cases discovered in further use will be addressed in future updates.

Notes

  • These changes represent a significant step towards more consistent and flexible issue detection in Datalab, particularly for outlier detection.
  • The PR aims to balance immediate improvements with the potential for future enhancements, prioritizing incremental improvements to the library.

elisno added 7 commits June 19, 2024 23:41
This pattern repeats throughout the issue_managers package
Refactored the internal method to handle cases where neither 'features' nor 'pred_probs' are available without raising an error. The responsibility for input validation is now delegated to the `set_knn_graph` helper function.
- Use set_knn_graph for simplified setup from helper module.
- Validation Check: Added a check to ensure the provided KNN graph  (not the one stored in statistics or computed) has at least as many neighbors as required (self.k). Raises a ValueError with a clear message if the condition is not met.
- Test Updates: Modified tests to align with the new KNN graph handling logic and added test cases to validate the new error handling. Turns out that previously, this issue manager had the side-effect of decreasing self.k, which none of the others ever did. It's far more reasonable to raise an error in this case for consistent behavior. This should only happen when the knn graph is explicitly pass in by the user during this check.
@jwmueller
Copy link
Member

can you ask @sanjanag for review? Thanks!

@elisno elisno assigned sanjanag and unassigned sanjanag Jun 20, 2024
@elisno elisno requested a review from sanjanag June 20, 2024 13:27
This commit introduces several important changes to the OutlierIssueManager:

1. Simplified API: Replace NearestNeighbors object with direct k, t, and metric parameters in the constructor. This reduces dependency on OutOfDistribution and aligns with Datalab's approach.

2. Enhanced find_issues method: Restructured to better handle different input scenarios (kNN graph, features, pred_probs). Added comments to explain the logic and decision-making process.

3. Improved kNN graph handling: Introduced _knn_graph_works and set_knn_graph helper functions for more robust graph management.

4. Updated test cases: Modified to reflect new API and behavior, particularly in kNN graph verification.

5. Removed metric storage for user-provided kNN graphs: Users are now responsible for tracking their own metrics when providing a pre-computed graph.

6. Future-proofing: Prepared for potential future storage of NearestNeighbors objects in info["statistics"] for global reuse across issue managers.

These changes aim to make the outlier detection process more flexible, easier to understand, and consistent with Datalab's design principles. They also lay the groundwork for better documentation and user guidance in future updates.

Note: The removal of the NearestNeighbors check in tests allows for more flexible handling of kNN objects, either user-constructed or potentially stored in global statistics in the future.
@elisno elisno changed the title Refactor logic for deciding on knn graph recomputation in issue managers Refactor KNN graph handling and outlier detection in issue managers Jun 21, 2024
The test_default_issue_types in the TestDataMonitorReuseStatisticInfo class is expected to fail due to a simplified API of the outlier issue manager in Datalab.
DataMonitor is in experimental, where no public API has been finalized.
@sanjanag
Copy link
Member

LGTM, merge whenever the comments are in.

@elisno elisno merged commit 2a68fd7 into cleanlab:master Jun 24, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants