Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLBF validation logic and corresponding management command #22983

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Jan 13, 2025

Fixes: mozilla/addons#15281

Description

  • Implemented a validate method in the MLBF class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.
  • Added a new management command validate_mlbf to facilitate validation of MLBF directories from the command line.
  • Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.

Context

We have discovered that MLBFs created since November 2024 are larger than expected. This validation is one step to ensure we do not see unchecked growth in the mlbf by verifying the underlying data set used to produce the filter.

Testing

  • Download the cache.json from the (privately) linked g-drive URL
  • move the mlbf directory into your "storage" directory in the repository
  • Run the below command to validate the data
./manage.py validate_mlbf.py <id>

Where id is the timestamp directory name ./mlbf/<id>/.

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

- Implemented a `validate` method in the `MLBF` class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.
- Added a new management command `validate_mlbf` to facilitate validation of MLBF directories from the command line.
- Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.
@KevinMind
Copy link
Contributor Author

KevinMind commented Jan 13, 2025

Verified on production data that we do in fact have duplicates and the verification command catches them.

image

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

@Rob--W
Copy link
Member

Rob--W commented Jan 13, 2025

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

Note: although cache.json here does indeed have duplicates (which I had observed myself as well), it is not necessarily the root cause for the issues observed in #15261.

In my local testing that lead to mozilla/addons#15261 (comment), I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

@KevinMind KevinMind requested review from a team and eviljeff and removed request for a team January 13, 2025 15:03
@KevinMind
Copy link
Contributor Author

KevinMind commented Jan 13, 2025

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

That makes sense assuming the way you generated the filter is the same as the way production generated it. My understanding of that is based on the answer to the first question. But even if that were the case, if you did not de-duplicate the file and generated a filter that is smaller.. how did that happen? Isn't it even more odd if we get different sized filters on the same data set in different environments? I must be missing something from your comment.

@Rob--W
Copy link
Member

Rob--W commented Jan 13, 2025

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

The production file size is 931343 and linked from https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.
I generated a new MLBF from cache.json and got a smaller file. Conclusion: the actual MLBF generation on AMO prod includes more data than what is written to cache.json.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

This is the code using filtercascade, similar to what generate_mlbf in mlbf.py does: https://gist.github.com/Rob--W/39940bf6266d5e5b2c3b7e65da67696b

I've attached the command output to https://mozilla-hub.atlassian.net/browse/AMOENG-1332?focusedCommentId=988314

@KevinMind KevinMind requested a review from Rob--W January 13, 2025 19:39
Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this fix up so quickly!

The fix of the primary issue looks good to me.

I'll defer to AMO engineers for the approving sign-off, in case they have thoughts on the tests or implementation details.

)

reused_addon.update(guid=GUID_REUSE_FORMAT.format(addon.id))
reused_addon.addonguid.update(guid=addon.guid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sanity check that this does indeed generate duplicate entries? Otherwise the test may pass trivially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I would do that outside of what is already here. I'm explicitly setting the addonguid.guid of reused_addon to the guid of addon... neither are blocked so I cannot check for 2 non existent blocks.. so scratching my head on what exactly to verify?..

deduped_versions = []

for version in versions:
if version not in unique_versions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is there a benefit to looping over the input and checking it for each entry, over using list(set(versions)) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to set and back to list breaks order in Python. And it is not faster to do so. So there is actually no real benefit other than code prettiness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested this in other PRs too, but we could just keep everything as sets, and then sorted(...) before being written to the JSON file. Python has efficient set handling (and the syntax is prettier too).

@@ -77,7 +77,7 @@ def generate_mlbf(stats, include, exclude):
# Extends the BlockType enum to include versions that have no block of any type
MLBFDataType = Enum(
'MLBFDataType',
[block_type.name for block_type in BlockType] + ['NOT_BLOCKED', 'not_blocked'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the primary fix to the cause of mozilla/addons#15261 (comment)

@KevinMind KevinMind marked this pull request as ready for review January 14, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to verify no duplicate entries on MLBF cache.json
3 participants