Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a knn module #1117

Merged
merged 60 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
b9b9274
Add test for all identical examples in test_regression.py
elisno May 3, 2024
182f50c
Add tests for detecting label issues in all identical examples dataset
elisno May 3, 2024
a0c197c
Add tests for detecting label issues in all identical examples dataset
elisno May 3, 2024
fb93c07
clarify test_all_identical_examples in test_regression.py
elisno May 3, 2024
08e86a6
remove unused Datalab test fixtures in test_all_identical_examples.py
elisno May 3, 2024
ec8743e
Improve readability and documentation of test class for all identical…
elisno May 3, 2024
cbd1eab
Rename variable for number of feature columns (K -> M)
elisno May 3, 2024
e841249
add more issue types for classification (underperforming groups and c…
elisno May 4, 2024
49b17ee
fix typos
elisno May 4, 2024
25c4c83
update documentation of TestAllIdenticalExamplesDataset
elisno May 4, 2024
747331a
Add decide_metric function to determine distance metric for neighbor …
elisno May 2, 2024
e6f2251
Add NeighborSearch protocol for k-nearest neighbors search
elisno May 2, 2024
f90641e
Add types for FeatureArray and Metric in neighbor/types.py
elisno May 2, 2024
eae62e8
rename test file
elisno May 2, 2024
a40f31d
Add features_to_knn function to build and fit a k-nearest neighbors s…
elisno May 2, 2024
a588774
export features_to_knn from neighbor submodule
elisno May 2, 2024
0604ab3
Add docs for cleanlab.internal.neighbor modules
elisno May 2, 2024
a0d6579
correct neighbor.py
elisno May 2, 2024
638ae31
let search.py only work with NearestNeighbors
elisno May 2, 2024
a13c1f8
refactor outlier.py to use knn construction function
elisno May 2, 2024
76d16d2
Refactor duplicate.py to use features_to_knn function for constructin…
elisno May 3, 2024
5b33870
Refactor duplicate.py to use the knn_to_knn_graph function defined in…
elisno May 3, 2024
45c4c3f
remove unused imports in outlier.py
elisno May 3, 2024
32d80ff
test knn_to_knn_graph
elisno May 3, 2024
a44d367
ignore unused import in __init__.py
elisno May 3, 2024
ceb3744
Refactor regression.rank.py to use features_to_knn function for const…
elisno May 3, 2024
1082702
Fix default value for neighbor_metric in rank.py
elisno May 3, 2024
8d662e2
improve clarity of code selecting number of neighbors, add comments
elisno May 6, 2024
4395874
Refactor metric.py to improve clarity of code selecting number of nei…
elisno May 6, 2024
e2dd0fc
rename knn_to_knn_graph to construct_knn_graph_from_index
elisno May 6, 2024
47a7715
add edge case for test_features_to_knn where last 11 points are ident…
elisno May 6, 2024
85b7704
Test more attributes of NearestNeighbors object
elisno May 6, 2024
c69205c
comment on DEFAULT_K
elisno May 6, 2024
c826901
Appy suggestion
elisno May 6, 2024
04ef982
Move type alias definitions to cleanlab/typing.py
elisno May 6, 2024
019629b
add comment about distance_metric being logged as string
elisno May 6, 2024
b21db8f
fix imports of types
elisno May 6, 2024
3eb4d1b
delete types.py, all types moved to typing.py
elisno May 6, 2024
f45370d
enumerate properties and attributes that a NearestNeighbors-like obje…
elisno May 6, 2024
dd331c6
Split decide_metric into two functions
elisno May 6, 2024
304ef71
ensure euclidean metric is used for outre method of computing label q…
elisno May 6, 2024
e392b73
rename module neighbor.py to knn_graph.py
elisno May 6, 2024
981776f
fix import of renamed function for deciding on default metric
elisno May 6, 2024
694625b
Replace .kneighbors_graph calls with construct_knn_graph_from_index c…
elisno May 6, 2024
b4f7a1e
Add construct_knn_graph_from_features to let issue managers create kn…
elisno May 6, 2024
fb52a6b
add more test cases for all-identical examples.
elisno May 6, 2024
31f9350
Refactor NearDuplicateIssueManager to use construct_knn_graph_from_fe…
elisno May 6, 2024
acefd1d
update docs
elisno May 6, 2024
cf798b0
update comment
elisno May 6, 2024
b420a5a
Merge branch 'master' into knn-module
elisno May 6, 2024
579bdb0
remove unused imports
elisno May 6, 2024
3314ea3
Update cleanlab/internal/neighbor/knn_graph.py
elisno May 7, 2024
402d3fe
Update cleanlab/internal/neighbor/knn_graph.py
elisno May 7, 2024
d100541
Update construct_knn_graph_from_features to return both knn_graph and…
elisno May 7, 2024
c6ec459
improve code readability in _configure_num_neighbors
elisno May 7, 2024
318d9a5
Rename function for accuracy
elisno May 7, 2024
c8a7061
remove unused import
elisno May 7, 2024
22965b9
Apply suggestions from code review
elisno May 13, 2024
d4d2d3a
Update cleanlab/internal/neighbor/metric.py
elisno May 14, 2024
bda30f2
apply black formatter
elisno May 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 5 additions & 29 deletions cleanlab/data_valuation.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,12 @@
"""


from typing import Callable, Optional, Union, cast
from typing import Callable, Optional, Union

import numpy as np
from scipy.sparse import csr_matrix
from scipy.spatial.distance import euclidean
from sklearn.neighbors import NearestNeighbors
from sklearn.exceptions import NotFittedError
from sklearn.utils.validation import check_is_fitted

from cleanlab.internal.neighbor.knn_graph import create_knn_graph_and_index


def _knn_shapley_score(knn_graph: csr_matrix, labels: np.ndarray, k: int) -> np.ndarray:
Expand All @@ -45,29 +43,6 @@ def _knn_shapley_score(knn_graph: csr_matrix, labels: np.ndarray, k: int) -> np.
return 0.5 * (np.mean(scores / k, axis=0) + 1)


def _process_knn_graph_from_features(
features: np.ndarray, metric: Optional[Union[str, Callable]], k: int = 10
) -> csr_matrix:
"""Calculate the knn graph from the features if it is not provided in the kwargs."""
if k > len(features): # Ensure number of neighbors less than number of examples
raise ValueError(
f"Number of nearest neighbors k={k} cannot exceed the number of examples N={len(features)} passed into the estimator (knn)."
)
if metric == None:
metric = (
"cosine"
if features.shape[1] > 3
else "euclidean" if features.shape[0] > 100 else euclidean
)
knn = NearestNeighbors(n_neighbors=k, metric=metric).fit(features)
knn_graph = knn.kneighbors_graph(mode="distance")
try:
check_is_fitted(knn)
except NotFittedError:
knn.fit(features)
return knn_graph


def data_shapley_knn(
labels: np.ndarray,
*,
Expand Down Expand Up @@ -135,6 +110,7 @@ def data_shapley_knn(
if knn_graph is None and features is None:
raise ValueError("Either knn_graph or features must be provided.")

# Use provided knn_graph or compute it from features
if knn_graph is None:
knn_graph = _process_knn_graph_from_features(cast(np.ndarray, features), metric, k)
knn_graph, _ = create_knn_graph_and_index(features, n_neighbors=k, metric=metric)
return _knn_shapley_score(knn_graph, labels, k)
34 changes: 4 additions & 30 deletions cleanlab/datalab/internal/issue_manager/data_valuation.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,15 @@
Optional,
Union,
)
import warnings


import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.spatial.distance import euclidean
from sklearn.exceptions import NotFittedError
from sklearn.neighbors import NearestNeighbors
from sklearn.utils.validation import check_is_fitted

from cleanlab.data_valuation import data_shapley_knn
from cleanlab.datalab.internal.issue_manager import IssueManager
from cleanlab.internal.neighbor.knn_graph import create_knn_graph_and_index

if TYPE_CHECKING: # pragma: no cover
import numpy.typing as npt
Expand Down Expand Up @@ -139,33 +135,11 @@ def find_issues(
)
raise TypeError(error_msg)
if knn_graph is None or metric_changes:
if features is None:
raise ValueError(
"If a knn_graph is not provided, features must be provided to fit a new knn."
)
if self.metric is None:
self.metric = (
"cosine"
if features.shape[1] > 3
else "euclidean" if features.shape[0] > 100 else euclidean
)
knn = NearestNeighbors(n_neighbors=self.k, metric=self.metric).fit(features)

if self.metric and self.metric != knn.metric:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition won't evaluate to True, because even if the metric changed, then the knn object must have been given the same metric in this part of the code.

Removing it is safe.

warnings.warn(
f"Metric {self.metric} does not match metric {knn.metric} used to fit knn. "
"Most likely an existing NearestNeighbors object was passed in, but a different "
"metric was specified."
)
knn_graph, knn = create_knn_graph_and_index(
features, n_neighbors=self.k, metric=self.metric
)
self.metric = knn.metric

try:
check_is_fitted(knn)
except NotFittedError:
knn.fit(features)

knn_graph = knn.kneighbors_graph(mode="distance")

scores = data_shapley_knn(labels, knn_graph=knn_graph, k=self.k)

self.issues = pd.DataFrame(
Expand Down
34 changes: 5 additions & 29 deletions cleanlab/datalab/internal/issue_manager/duplicate.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,10 @@
import numpy as np
elisno marked this conversation as resolved.
Show resolved Hide resolved
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.spatial.distance import euclidean
from sklearn.neighbors import NearestNeighbors
from sklearn.exceptions import NotFittedError
from sklearn.utils.validation import check_is_fitted


from cleanlab.datalab.internal.issue_manager import IssueManager
from cleanlab.internal.neighbor.knn_graph import create_knn_graph_and_index
from cleanlab.internal.constants import EPSILON

if TYPE_CHECKING: # pragma: no cover
Expand Down Expand Up @@ -76,32 +74,10 @@ def find_issues(
metric_changes = self.metric and self.metric != old_knn_metric

if knn_graph is None or metric_changes:
if features is None:
raise ValueError(
"If a knn_graph is not provided, features must be provided to fit a new knn."
)
if self.metric is None:
self.metric = (
"cosine"
if features.shape[1] > 3
else "euclidean" if features.shape[0] > 100 else euclidean
)
knn = NearestNeighbors(n_neighbors=self.k, metric=self.metric)

if self.metric and self.metric != knn.metric:
warnings.warn(
f"Metric {self.metric} does not match metric {knn.metric} used to fit knn. "
"Most likely an existing NearestNeighbors object was passed in, but a different "
"metric was specified."
)
knn_graph, knn = create_knn_graph_and_index(
features, n_neighbors=self.k, metric=self.metric
)
self.metric = knn.metric

try:
check_is_fitted(knn)
except NotFittedError:
knn.fit(features)

knn_graph = knn.kneighbors_graph(mode="distance")
N = knn_graph.shape[0]
nn_distances = knn_graph.data.reshape(N, -1)[:, 0]
median_nn_distance = max(np.median(nn_distances), EPSILON) # avoid threshold = 0
Expand Down
31 changes: 4 additions & 27 deletions cleanlab/datalab/internal/issue_manager/noniid.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
from __future__ import annotations

from typing import TYPE_CHECKING, Any, Callable, ClassVar, Dict, Optional, Union, cast
import warnings
import itertools

from scipy.stats import gaussian_kde
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.spatial.distance import euclidean
from sklearn.neighbors import NearestNeighbors
from sklearn.exceptions import NotFittedError
from sklearn.utils.validation import check_is_fitted

from cleanlab.datalab.internal.issue_manager import IssueManager
from cleanlab.internal.neighbor.knn_graph import construct_knn_graph_from_index, features_to_knn

if TYPE_CHECKING: # pragma: no cover
import numpy.typing as npt
Expand Down Expand Up @@ -203,28 +200,8 @@ def _setup_knn(
return None
features_to_use = self._determine_features(features, pred_probs)

if self.metric is None:
self.metric = (
"cosine"
if features_to_use.shape[1] > 3
else "euclidean" if features_to_use.shape[0] > 100 else euclidean
)

knn = NearestNeighbors(n_neighbors=self.k, metric=self.metric)

if self.metric != knn.metric:
warnings.warn(
f"Metric {self.metric} does not match metric {knn.metric} used to fit knn. "
"Most likely an existing NearestNeighbors object was passed in, but a different "
"metric was specified."
)
self.metric = knn.metric

try:
check_is_fitted(knn)
except NotFittedError:
knn.fit(features_to_use)

knn = features_to_knn(features_to_use, n_neighbors=self.k, metric=self.metric)
self.metric = knn.metric # Update the metric to the one used in the KNN object.
return knn

def find_issues(
Expand Down Expand Up @@ -305,7 +282,7 @@ def collect_info(
}
if knn_graph is None:
assert knn is not None, "If knn_graph is None, knn must be provided."
knn_graph = knn.kneighbors_graph(mode="distance") # type: ignore[union-attr]
knn_graph = construct_knn_graph_from_index(knn)

assert knn_graph is not None, "knn_graph must be provided or computed."
statistics_dict = self._build_statistics_dictionary(knn_graph=knn_graph)
Expand Down
3 changes: 2 additions & 1 deletion cleanlab/datalab/internal/issue_manager/outlier.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import pandas as pd

from cleanlab.datalab.internal.issue_manager import IssueManager
from cleanlab.internal.neighbor.knn_graph import construct_knn_graph_from_index
from cleanlab.outlier import OutOfDistribution, transform_distances_to_scores

if TYPE_CHECKING: # pragma: no cover
Expand Down Expand Up @@ -219,7 +220,7 @@ def _process_knn_graph_from_features(self, kwargs: Dict) -> csr_matrix:
# If the pre-existing knn graph has fewer neighbors than the knn object,
# then we need to recompute the knn graph
assert knn == self.ood.params["knn"] # type: ignore[union-attr]
knn_graph = knn.kneighbors_graph(mode="distance") # type: ignore[union-attr]
knn_graph = construct_knn_graph_from_index(knn)
self._metric = knn.metric # type: ignore[union-attr]

return knn_graph
Expand Down
32 changes: 4 additions & 28 deletions cleanlab/datalab/internal/issue_manager/underperforming_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,10 @@
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.spatial.distance import euclidean
from sklearn.neighbors import NearestNeighbors
from sklearn.exceptions import NotFittedError
from sklearn.utils.validation import check_is_fitted
from sklearn.cluster import DBSCAN

from cleanlab.datalab.internal.issue_manager import IssueManager
from cleanlab.internal.neighbor.knn_graph import create_knn_graph_and_index
from cleanlab.rank import get_self_confidence_for_each_label

if TYPE_CHECKING: # pragma: no cover
Expand Down Expand Up @@ -153,31 +150,10 @@ def set_knn_graph(
metric_changes = self.metric and self.metric != old_knn_metric

if knn_graph is None or metric_changes:
if features is None:
raise ValueError(
"If a knn_graph is not provided, features must be provided to fit a new knn."
)
if self.metric is None:
self.metric = (
"cosine"
if features.shape[1] > 3
else "euclidean" if features.shape[0] > 100 else euclidean
)
knn = NearestNeighbors(n_neighbors=self.k, metric=self.metric)

if self.metric and self.metric != knn.metric:
warnings.warn(
f"Metric {self.metric} does not match metric {knn.metric} used to fit knn. "
"Most likely an existing NearestNeighbors object was passed in, but a different "
"metric was specified."
)
knn_graph, knn = create_knn_graph_and_index(
features, n_neighbors=self.k, metric=self.metric
)
self.metric = knn.metric

try:
check_is_fitted(knn)
except NotFittedError:
knn.fit(features)
knn_graph = knn.kneighbors_graph(mode="distance")
return knn_graph

def perform_clustering(self, knn_graph: csr_matrix) -> npt.NDArray[np.int_]:
Expand Down
1 change: 1 addition & 0 deletions cleanlab/internal/neighbor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .knn_graph import features_to_knn
Loading
Loading