Add MLBF validation logic and corresponding management command #22983

KevinMind · 2025-01-13T10:43:50Z

Description

Implemented a validate method in the MLBF class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type.
Added a new management command validate_mlbf to facilitate validation of MLBF directories from the command line.
Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.

Context

We have discovered that MLBFs created since November 2024 are larger than expected. This validation is one step to ensure we do not see unchecked growth in the mlbf by verifying the underlying data set used to produce the filter.

Testing

Download the cache.json from the (privately) linked g-drive URL
move the mlbf directory into your "storage" directory in the repository
Run the below command to validate the data

./manage.py validate_mlbf.py <id>

Where id is the timestamp directory name ./mlbf/<id>/.

Checklist

Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
Successfully verified the change locally.
The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
Add before and after screenshots (Only for changes that impact the UI).
Add or update relevant docs reflecting the changes made.

- Implemented a `validate` method in the `MLBF` class to check for duplicate items in the cache.json, ensuring each item occurs only once and in a single data type. - Added a new management command `validate_mlbf` to facilitate validation of MLBF directories from the command line. - Created unit tests to verify the validation logic, ensuring it raises errors for duplicate items within a single data type and across multiple data types.

KevinMind · 2025-01-13T14:19:51Z

Verified on production data that we do in fact have duplicates and the verification command catches them.

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

Rob--W · 2025-01-13T14:59:49Z

Now to find out how that happened in the first place and how to prevent.

FYI @Rob--W you're theory seems to be correct.

Note: although cache.json here does indeed have duplicates (which I had observed myself as well), it is not necessarily the root cause for the issues observed in #15261.

In my local testing that lead to mozilla/addons#15261 (comment), I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

KevinMind · 2025-01-13T15:08:05Z

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

That makes sense assuming the way you generated the filter is the same as the way production generated it. My understanding of that is based on the answer to the first question. But even if that were the case, if you did not de-duplicate the file and generated a filter that is smaller.. how did that happen? Isn't it even more odd if we get different sized filters on the same data set in different environments? I must be missing something from your comment.

Rob--W · 2025-01-13T15:14:54Z

I generated a new MLBF solely based on the cache.json file, without de-duplication. This means that if there were any duplicates in cache.json, that it would directly contribute to the generated MLBF. The fact that my MLBF is still smaller than what was observed in production is a strong signal that there is additional duplicate data that is not stored in cache.json.

I'm not sure I follow this logic.

The production file size is 931343 and linked from https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.
I generated a new MLBF from cache.json and got a smaller file. Conclusion: the actual MLBF generation on AMO prod includes more data than what is written to cache.json.

I generated a new MLBF solely based on the cache.json file, without de-duplication.

What do you mean by that? like what code did you actually run?

This is the code using filtercascade, similar to what generate_mlbf in mlbf.py does: https://gist.github.com/Rob--W/39940bf6266d5e5b2c3b7e65da67696b

I've attached the command output to https://mozilla-hub.atlassian.net/browse/AMOENG-1332?focusedCommentId=988314

… for valid keys in MLBF loader

Rob--W

Thanks for putting this fix up so quickly!

The fix of the primary issue looks good to me.

I'll defer to AMO engineers for the approving sign-off, in case they have thoughts on the tests or implementation details.

Rob--W · 2025-01-13T19:58:06Z

src/olympia/blocklist/tests/test_mlbf.py

+        )
+
+        reused_addon.update(guid=GUID_REUSE_FORMAT.format(addon.id))
+        reused_addon.addonguid.update(guid=addon.guid)


Add sanity check that this does indeed generate duplicate entries? Otherwise the test may pass trivially.

I'm not sure how I would do that outside of what is already here. I'm explicitly setting the addonguid.guid of reused_addon to the guid of addon... neither are blocked so I cannot check for 2 non existent blocks.. so scratching my head on what exactly to verify?..

Rob--W · 2025-01-13T20:01:22Z

src/olympia/blocklist/mlbf.py

+        deduped_versions = []
+
+        for version in versions:
+            if version not in unique_versions:


Question: is there a benefit to looping over the input and checking it for each entry, over using list(set(versions)) ?

Converting to set and back to list breaks order in Python. And it is not faster to do so. So there is actually no real benefit other than code prettiness.

I've suggested this in other PRs too, but we could just keep everything as sets, and then sorted(...) before being written to the JSON file. Python has efficient set handling (and the syntax is prettier too).

Rob--W · 2025-01-13T20:02:39Z

src/olympia/blocklist/mlbf.py

@@ -77,7 +77,7 @@ def generate_mlbf(stats, include, exclude):
 # Extends the BlockType enum to include versions that have no block of any type
 MLBFDataType = Enum(
    'MLBFDataType',
-    [block_type.name for block_type in BlockType] + ['NOT_BLOCKED', 'not_blocked'],


This is the primary fix to the cause of mozilla/addons#15261 (comment)

KevinMind added 2 commits January 13, 2025 11:38

TMP: Fix test

48cb324

KevinMind added 2 commits January 13, 2025 15:21

TMP: mo bettah

6826291

TMP: revert the extra-ness

3134b8c

KevinMind requested review from a team and eviljeff and removed request for a team January 13, 2025 15:03

KevinMind added 3 commits January 13, 2025 17:00

Refactor MLBFDataType enum to remove 'not_blocked' value and add test…

856414c

… for valid keys in MLBF loader

Deduplicate items within each MLBFData type

24a5dcd

TMP: tes the not_blocked deuping (simple)

b163fc6

KevinMind requested a review from Rob--W January 13, 2025 19:39

Rob--W reviewed Jan 13, 2025

View reviewed changes

KevinMind marked this pull request as ready for review January 14, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLBF validation logic and corresponding management command #22983

Add MLBF validation logic and corresponding management command #22983

KevinMind commented Jan 13, 2025 •

edited

Loading

KevinMind commented Jan 13, 2025 •

edited

Loading

Rob--W commented Jan 13, 2025

KevinMind commented Jan 13, 2025 •

edited

Loading

Rob--W commented Jan 13, 2025 •

edited

Loading

Rob--W left a comment

Rob--W Jan 13, 2025

KevinMind Jan 14, 2025

Rob--W Jan 13, 2025

KevinMind Jan 13, 2025

eviljeff Jan 14, 2025

Rob--W Jan 13, 2025

Add MLBF validation logic and corresponding management command #22983

Are you sure you want to change the base?

Add MLBF validation logic and corresponding management command #22983

Conversation

KevinMind commented Jan 13, 2025 • edited Loading

Description

Context

Testing

Checklist

KevinMind commented Jan 13, 2025 • edited Loading

Rob--W commented Jan 13, 2025

KevinMind commented Jan 13, 2025 • edited Loading

Rob--W commented Jan 13, 2025 • edited Loading

Rob--W left a comment

Choose a reason for hiding this comment

Rob--W Jan 13, 2025

Choose a reason for hiding this comment

KevinMind Jan 14, 2025

Choose a reason for hiding this comment

Rob--W Jan 13, 2025

Choose a reason for hiding this comment

KevinMind Jan 13, 2025

Choose a reason for hiding this comment

eviljeff Jan 14, 2025

Choose a reason for hiding this comment

Rob--W Jan 13, 2025

Choose a reason for hiding this comment

KevinMind commented Jan 13, 2025 •

edited

Loading

KevinMind commented Jan 13, 2025 •

edited

Loading

KevinMind commented Jan 13, 2025 •

edited

Loading

Rob--W commented Jan 13, 2025 •

edited

Loading