Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tubular Tutorial: Dynamically handle displaying exact duplicate sets #1128

Merged
merged 4 commits into from
May 24, 2024

Conversation

nelsonauner
Copy link
Contributor

Summary

🎯 Purpose: Describe the objective of your changes in this Pull-Request.

This simple change improves the tutorial by making sure the referenced row (Student 690) actually is displayed in the tutorial! Otherwise can lead to a very confusing result, for example, when I ran the tutorial:

image

A quick change to sort fixes this. Using mergesort is important because it's the only stable sort method to keep the duplicate-score-sorting working - see https://stackoverflow.com/questions/33699555/pandas-sorting-by-value-and-then-by-index

Here's the results. Student 690 isn't the first row, so I also updated the copy

image

@CLAassistant
Copy link

CLAassistant commented May 24, 2024

CLA assistant check
All committers have signed the CLA.

@nelsonauner nelsonauner force-pushed the tutorial-fix-sorting-duplicates branch from a960fab to 1712bdd Compare May 24, 2024 14:22
@nelsonauner nelsonauner changed the title Sort duplicate results before displaying Tubular Tutorial: Sort duplicate results before displaying May 24, 2024
@elisno
Copy link
Member

elisno commented May 24, 2024

Hi @nelsonauner !

Thank you for pointing out this issue! Good catch on the unstable sorting algorithm.
It turns out there are just too many sets of exact duplicates in this tutorial, leading to varying results.

I've made some changes to ensure a more robust and clear approach.

  • In the tutorial, we display the lowest-scoring example (based on the near-duplicate score) and its associated near-duplicate sets. I don't see a reason for us to hard-code which examples to display.

  • I also updated the relevant text to be agnostic to specific hardcoded IDs (690, 246, etc.).

Finally, I added tests at the end of the notebook to ensure that all displayed rows are identical (since we state they are exact duplicates).


The ideal situation would be if we were able to add more interactivity to the docs, like an interactive widget:

Upptaka.2024-05-24.210926.mp4

Unfortunately, this isn't so straightforward when we build the docs with these notebooks.

@elisno elisno changed the title Tubular Tutorial: Sort duplicate results before displaying Tubular Tutorial: Dynamically handle displaying exact duplicate sets May 24, 2024
@elisno elisno merged commit 51b147b into cleanlab:master May 24, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants