Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalab_regression #796

Merged
merged 121 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
4f6df23
datalab_regression
mglowacki100 Aug 7, 2023
3d58d98
Update label.py
mglowacki100 Aug 7, 2023
1c9fa64
black format, issue_name for regression
mglowacki100 Aug 7, 2023
8a0a291
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
d120d57
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
29e9a83
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
d202498
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
46fafc5
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
c5af243
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
323b5d0
Update cleanlab/datalab/internal/report.py
mglowacki100 Aug 7, 2023
d167899
Update cleanlab/datalab/internal/issue_finder.py
mglowacki100 Aug 7, 2023
430db78
Update cleanlab/datalab/internal/issue_manager_factory.py
mglowacki100 Aug 7, 2023
0050294
Update cleanlab/datalab/internal/issue_manager_factory.py
mglowacki100 Aug 7, 2023
3b4bbf9
Update cleanlab/datalab/internal/issue_manager_factory.py
mglowacki100 Aug 7, 2023
7c8cb7e
Update cleanlab/datalab/internal/issue_manager_factory.py
mglowacki100 Aug 7, 2023
02e8aac
Update issue_manager_factory.py
mglowacki100 Aug 7, 2023
817ce2e
fixing required positional argument: 'task'
mglowacki100 Aug 7, 2023
922fb2f
reverse changes and change tests
mglowacki100 Aug 7, 2023
e832db8
change in architecture - test
mglowacki100 Aug 7, 2023
9afb9bb
Update datalab.py
mglowacki100 Aug 7, 2023
8e4044d
Update issue_finder.py
mglowacki100 Aug 7, 2023
d0f102b
Update report.py
mglowacki100 Aug 7, 2023
fa1834a
fixing imagelab
mglowacki100 Aug 7, 2023
1a98816
fixing some errors
mglowacki100 Aug 7, 2023
9a6b6fd
fixing tests
mglowacki100 Aug 7, 2023
10b6753
Fixing tests - missing args
mglowacki100 Aug 7, 2023
28a926c
Update test_factory.py
mglowacki100 Aug 7, 2023
ccd4dd8
fixing tests
mglowacki100 Aug 7, 2023
45d2a59
Update test_datalab.py
mglowacki100 Aug 8, 2023
f998d13
Fixing imagelab
mglowacki100 Aug 8, 2023
a831ba7
Update datalab.py
mglowacki100 Aug 8, 2023
f3248c1
Update imagelab.py
mglowacki100 Aug 8, 2023
f9b4b33
Update issue_manager_factory.py
mglowacki100 Aug 8, 2023
2d615f8
Update issue_manager_factory.py
mglowacki100 Aug 8, 2023
bb254b4
Update issue_manager_factory.py
mglowacki100 Aug 8, 2023
c252f48
Update issue_manager_factory.py
mglowacki100 Aug 8, 2023
dcb921b
Update test_issue_manager.py
mglowacki100 Aug 8, 2023
ade4691
Update test_issue_manager.py
mglowacki100 Aug 8, 2023
c86802d
Draft docstrings
mglowacki100 Aug 8, 2023
137faa6
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Aug 9, 2023
ca78d5d
Update label.py
mglowacki100 Aug 10, 2023
118f19a
Update label.py
mglowacki100 Aug 10, 2023
273731c
Update label.py
mglowacki100 Aug 10, 2023
6b924e7
Update label.py
mglowacki100 Aug 10, 2023
6132cf5
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Aug 10, 2023
1107365
Merge branch 'dl_regression' of https://github.com/mglowacki100/clean…
mglowacki100 Aug 10, 2023
0467000
datalab regression
mglowacki100 Aug 17, 2023
a29145b
DataIssues with strategy pattern
mglowacki100 Aug 27, 2023
36b96de
black format and quick fix
mglowacki100 Aug 27, 2023
5cdfb72
Update test_data_issues.py
mglowacki100 Aug 27, 2023
92bbbd6
Update test_data_issues.py
mglowacki100 Aug 27, 2023
29a5b29
Update data_issues.py
mglowacki100 Aug 27, 2023
a2ca5ad
Update datalab.py
mglowacki100 Aug 27, 2023
f183067
default, possible issues test patch
mglowacki100 Aug 27, 2023
eae87b1
Fixing tests for default, possible issues
mglowacki100 Aug 27, 2023
1177c3c
Update data_issues.py
mglowacki100 Aug 27, 2023
8684932
- streamlining
mglowacki100 Aug 27, 2023
bf989f0
streamlining dataissues (type checks)
mglowacki100 Aug 27, 2023
bad9f25
Regression and dataissues type fixes
mglowacki100 Aug 28, 2023
6c6fbed
Minor fixes
mglowacki100 Aug 28, 2023
9baadbe
Update data_issues.py
mglowacki100 Aug 28, 2023
ba3d1f9
Update data_issues.py
mglowacki100 Aug 28, 2023
a049d4b
Update data_issues.py
mglowacki100 Aug 28, 2023
048725f
Update data_issues.py
mglowacki100 Aug 28, 2023
1348fa3
Update data_issues.py
mglowacki100 Aug 29, 2023
2a239a3
Update data_issues.py
mglowacki100 Aug 29, 2023
0b6d546
Update imagelab.py
mglowacki100 Aug 31, 2023
01257a7
code coverage - draft
mglowacki100 Sep 1, 2023
6d3ae7d
Update test_datalab.py
mglowacki100 Sep 1, 2023
cb68af3
Update test_issue_finder.py
mglowacki100 Sep 1, 2023
c766e1c
Update test_issue_finder.py
mglowacki100 Sep 1, 2023
8ac6329
Update test_datalab.py
mglowacki100 Sep 1, 2023
d95c79d
Update test_datalab.py
mglowacki100 Sep 1, 2023
241a313
Update test_datalab.py
mglowacki100 Sep 1, 2023
82e45a4
Update test_datalab.py
mglowacki100 Sep 1, 2023
ebb8299
Update test_datalab.py
mglowacki100 Sep 1, 2023
e008304
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Sep 1, 2023
fe55c63
Update datalab.py
mglowacki100 Sep 4, 2023
cb8c756
Merge branch 'dl_regression' of https://github.com/mglowacki100/clean…
mglowacki100 Sep 4, 2023
ab40911
fixing type issues
mglowacki100 Sep 4, 2023
d56bc20
typing issues
mglowacki100 Sep 4, 2023
dcae2fa
fixing type issues
mglowacki100 Sep 4, 2023
7a8d582
Update data_issues.py
mglowacki100 Sep 5, 2023
f3c33e6
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Sep 5, 2023
0d38917
Update datalab.py
mglowacki100 Sep 6, 2023
d25e39c
Merge branch 'dl_regression' of https://github.com/mglowacki100/clean…
mglowacki100 Sep 6, 2023
76e546b
Update datalab.py
mglowacki100 Sep 7, 2023
49635a7
Update label.py
mglowacki100 Sep 7, 2023
20c455b
_DataIssuesBuilder into the helper_factory.py
mglowacki100 Sep 7, 2023
77b3ad7
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Sep 7, 2023
651c2e6
refactoring issueFinder
mglowacki100 Sep 7, 2023
e33415f
Merge branch 'dl_regression' of https://github.com/mglowacki100/clean…
mglowacki100 Sep 7, 2023
339f7e8
refactoring IssueFinder hotfix
mglowacki100 Sep 7, 2023
0fa9997
Update datalab.py
mglowacki100 Sep 8, 2023
44c974b
Fixing two tests - "custom issue"
mglowacki100 Sep 8, 2023
b137c4b
Fixing issue with tests
mglowacki100 Sep 8, 2023
c69f9eb
'custom_issue' removal
mglowacki100 Sep 8, 2023
9e26187
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Sep 22, 2023
52d4ee2
Update cleanlab/datalab/internal/data_issues.py
mglowacki100 Oct 15, 2023
5bb6d3d
Update cleanlab/datalab/internal/data_issues.py
mglowacki100 Oct 15, 2023
290d19e
Update cleanlab/datalab/internal/issue_manager/regression/label.py
mglowacki100 Oct 15, 2023
e91fbc9
Update cleanlab/datalab/internal/data_issues.py
mglowacki100 Oct 15, 2023
7448142
Update cleanlab/datalab/internal/data_issues.py
mglowacki100 Oct 15, 2023
3057e14
method chaining
mglowacki100 Oct 15, 2023
5401665
Update issue_finder.py
mglowacki100 Oct 15, 2023
5de3fbc
Merge branch 'cleanlab:master' into dl_regression
mglowacki100 Oct 15, 2023
91359e9
Update datalab.py
mglowacki100 Oct 15, 2023
e5c12e3
Merge branch 'dl_regression' of https://github.com/mglowacki100/clean…
mglowacki100 Oct 15, 2023
a27f8ea
Update datalab.py
mglowacki100 Oct 15, 2023
0b9867f
inject task into IssueFinder
elisno Nov 14, 2023
5bbf46e
make different strategies for getting available issue types
elisno Nov 14, 2023
9d1a8e4
apply black formatter
elisno Nov 16, 2023
cf364c6
avoid mapping labels column for regression in Datalab
elisno Nov 18, 2023
3540d62
Pass in features to LabelIssueManager for regression
elisno Nov 20, 2023
585d14a
Use single REGISTRY
elisno Nov 20, 2023
3a77bcd
explicitly test get_available_issue_types for classification tasks
elisno Nov 20, 2023
a16633a
apply black formatter
elisno Nov 20, 2023
38e0901
remove class_imbalance from defaults
elisno Nov 20, 2023
4c6faae
minor updates to misc files
elisno Nov 20, 2023
b467478
Merge branch 'master' into pr/mglowacki100/796
elisno Nov 20, 2023
204651c
address missing imports for type-checking
elisno Nov 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Use single REGISTRY
The register method and issue manager factories will be more more maintainable if every issue manage is registered to a task.

For now, only label issue checks work for regression, as that is the only issue manger registered for regression.
  • Loading branch information
elisno committed Nov 20, 2023
commit 585d14a9e16df8bbad5268861192cd90f2f1a587
43 changes: 15 additions & 28 deletions cleanlab/datalab/internal/issue_manager_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,14 @@
)
from cleanlab.datalab.internal.issue_manager.regression import RegressionLabelIssueManager

REGISTRY: Dict[str, Type[IssueManager]] = {
"outlier": OutlierIssueManager,
"near_duplicate": NearDuplicateIssueManager,
"non_iid": NonIIDIssueManager,
}

TASK_SPECIFIC_REGISTRY: Dict[str, Dict[str, Type[IssueManager]]] = {
REGISTRY: Dict[str, Dict[str, Type[IssueManager]]] = {
"classification": {
"label": LabelIssueManager,
"class_imbalance": ClassImbalanceIssueManager,
"outlier": OutlierIssueManager,
"near_duplicate": NearDuplicateIssueManager,
"non_iid": NonIIDIssueManager,
},
"regression": {
"label": RegressionLabelIssueManager,
Expand Down Expand Up @@ -94,27 +92,22 @@ def from_str(cls, issue_type: str, task: str) -> Type[IssueManager]:
"issue_type must be a string, not a list. Try using from_list instead."
)

if task is None and issue_type not in REGISTRY:
raise ValueError(f"Invalid issue type: {issue_type}")
if task not in TASK_SPECIFIC_REGISTRY:
if task not in REGISTRY:
raise ValueError(
f"Invalid task type: {task}, must be in {list(TASK_SPECIFIC_REGISTRY.keys())}"
f"Invalid task type: {task}, must be in {list(REGISTRY.keys())}"
)
if issue_type not in TASK_SPECIFIC_REGISTRY[task] and issue_type not in REGISTRY:
if issue_type not in REGISTRY[task]:
raise ValueError(f"Invalid issue type: {issue_type} for task {task}")

if issue_type in TASK_SPECIFIC_REGISTRY[task]:
return TASK_SPECIFIC_REGISTRY[task][issue_type]

return REGISTRY[issue_type]
return REGISTRY[task][issue_type]

@classmethod
def from_list(cls, issue_types: List[str], task: str) -> List[Type[IssueManager]]:
"""Constructs a list of concrete issue manager classes from a list of strings."""
return [cls.from_str(issue_type, task) for issue_type in issue_types]


def register(cls: Type[IssueManager], task: str) -> Type[IssueManager]:
def register(cls: Type[IssueManager], task: str="classification") -> Type[IssueManager]:
"""Registers the issue manager factory.

Parameters
Expand Down Expand Up @@ -171,24 +164,18 @@ def find_issues(self, **kwargs):

name: str = str(cls.issue_name)

if task is not None and task not in TASK_SPECIFIC_REGISTRY:
if task not in REGISTRY:
raise ValueError(
f"Invalid task type: {task}, must be in {list(TASK_SPECIFIC_REGISTRY.keys())}"
f"Invalid task type: {task}, must be in {list(REGISTRY.keys())}"
)

if task is not None and name in TASK_SPECIFIC_REGISTRY[task]:
if name in REGISTRY[task]:
print(
f"Warning: Overwriting existing issue manager {name} with {cls} for task {task}."
"This may cause unexpected behavior."
)

if task is None and name in REGISTRY:
print(
f"Warning: Overwriting existing issue manager {name} with {cls}."
"This may cause unexpected behavior."
)

REGISTRY[name] = cls
REGISTRY[task][name] = cls
return cls


Expand All @@ -201,7 +188,7 @@ def list_possible_issue_types(task: str) -> List[str]:
--------
:py:class:`REGISTRY <cleanlab.datalab.internal.issue_manager_factory.REGISTRY>` : All available issue types and their corresponding issue managers can be found here.
"""
return list(REGISTRY.keys()) + list(TASK_SPECIFIC_REGISTRY.get(task, []))
return list(REGISTRY.get(task, []))


def list_default_issue_types(task: str) -> List[str]:
Expand All @@ -213,7 +200,7 @@ def list_default_issue_types(task: str) -> List[str]:
:py:class:`REGISTRY <cleanlab.datalab.internal.issue_manager_factory.REGISTRY>` : All available issue types and their corresponding issue managers can be found here.
"""
if task == "regression":
default_issue_types = ["label", "outlier", "near_duplicate", "non_iid"]
default_issue_types = ["label"]
else:
default_issue_types = [
"label",
Expand Down
11 changes: 8 additions & 3 deletions tests/datalab/test_datalab.py
Original file line number Diff line number Diff line change
Expand Up @@ -624,7 +624,7 @@ def test_custom_issue_manager_registered(self, lab, custom_issue_manager):
"""Test that a custom issue manager that is registered will be used."""
from cleanlab.datalab.internal.issue_manager_factory import register

register(custom_issue_manager, task=None)
register(custom_issue_manager)

assert lab.issues.empty
assert lab.issue_summary.empty
Expand All @@ -648,7 +648,7 @@ def test_find_issues_for_custom_issue_manager_with_custom_kwarg(
"""Test that a custom issue manager that is registered will be used."""
from cleanlab.datalab.internal.issue_manager_factory import register

register(custom_issue_manager, task=None)
register(custom_issue_manager)

assert lab.issues.empty
assert lab.issue_summary.empty
Expand All @@ -668,7 +668,8 @@ def test_find_issues_for_custom_issue_manager_with_custom_kwarg(
# Clean up registry
from cleanlab.datalab.internal.issue_manager_factory import REGISTRY

REGISTRY.pop(custom_issue_manager.issue_name)
# Find the custom issue manager in the registry and remove it
REGISTRY["classification"].pop(custom_issue_manager.issue_name)


@pytest.mark.parametrize(
Expand Down Expand Up @@ -910,6 +911,10 @@ def test_regression():
test_df = pd.DataFrame(X, columns=["c1", "c2", "c3"])
test_df["y"] = y
lab = Datalab(data=test_df, label_name="y", task="regression")

assert set(lab.list_default_issue_types()) == set(["label"])
assert set(lab.list_possible_issue_types()) == set(["label"])

lab.find_issues(features=X)
lab.report()
summary = lab.get_issue_summary()
Expand Down
4 changes: 2 additions & 2 deletions tests/datalab/test_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ def test_list_possible_issue_types(registry):
class TestIssueManager(IssueManager):
issue_name = test_key

TestIssueManager = register(TestIssueManager, task=None)
TestIssueManager = register(TestIssueManager)

issue_types = lab.list_possible_issue_types()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to self:

Revisit this line after the issue type listing methods have been refactored.

assert set(issue_types) == set(
possible_issues + [test_key]
), "New issue type should be added to the list"

# Clean up
del registry[test_key]
del registry["classification"][test_key]
3 changes: 2 additions & 1 deletion tests/datalab/test_issue_finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ def test_get_available_issue_types(self, issue_finder):
{"label": {"some_arg": "some_value"}, "outlier": {}},
{},
]
supported_issue_types = ["label"]
for issue_types in issue_types_dicts:
available_issue_types = issue_finder.get_available_issue_types(issue_types=issue_types)
fail_msg = f"Failed to get available issue types with issue_types={issue_types}"
Expand All @@ -103,4 +104,4 @@ def test_get_available_issue_types(self, issue_finder):
kwargs = {k: k for k in ["pred_probs", "features", "knn_graph"]}
kwargs["issue_types"] = {"label": {}}
available_issue_types = issue_finder.get_available_issue_types(**kwargs)
assert available_issue_types == {"label": {}}
assert available_issue_types == {"label": {"features": "features"}}
30 changes: 21 additions & 9 deletions tests/datalab/test_issue_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,9 @@ class Foo(IssueManager):
def find_issues(self):
pass

Foo = register(Foo, task=None)
Foo = register(Foo)

assert "foo" in REGISTRY
assert REGISTRY["foo"] == Foo
assert REGISTRY["classification"].get("foo") == Foo

# Reregistering should overwrite the existing class, put print a warning

Expand All @@ -63,14 +62,17 @@ class NewFoo(IssueManager):
def find_issues(self):
pass

NewFoo = register(NewFoo, task=None)
NewFoo = register(NewFoo)

assert "foo" in REGISTRY
assert REGISTRY["foo"] == NewFoo
assert REGISTRY["classification"].get("foo") == NewFoo
assert all(
[
text in sys.stdout.getvalue()
for text in ["Warning: Overwriting existing issue manager foo with ", "NewFoo"]
for text in [
"Warning: Overwriting existing issue manager foo with ",
"NewFoo",
" for task classification.",
]
]
), "Should print a warning"

Expand All @@ -83,8 +85,7 @@ def find_issues(self):

NewerFoo = register(NewerFoo, task="classification")

assert "label" in REGISTRY
assert REGISTRY["label"] == NewerFoo
assert REGISTRY["classification"].get("label") == NewerFoo
assert all(
[
text in sys.stdout.getvalue()
Expand All @@ -95,3 +96,14 @@ def find_issues(self):
]
]
), "Should print a warning"

# Registering any issue manager for another task is permitted
class Bar(IssueManager):
issue_name = "bar"

def find_issues(self):
pass

Bar = register(Bar, task="regression")

assert REGISTRY["regression"].get("bar") == Bar