-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New UI for listing object uses, including in RichText and StreamField. #4481
Conversation
e829b29
to
49bca1f
Compare
Hi @BertrandBordage, Prior to Wagtail 2.0, I haven't looked at your subsequent commits yet, but I guess you need an object to attach the "find entities of this type" method to. |
@gasman: Indeed, I did this because I needed a system to fetch models linked to an I also started bringing back elements together because either we want to create an extension for Hallo or Draftail, most things are the same: the related models, the way we fetch database objects from attributes, the way we select elements in the DOM, the name of the button, etc. To me, this is going exactly the right way because we want to make a registering system that devs can extend with a clear API. The day Hallo goes away, just one or two methods would have to be removed from the registering system. Today, sorry to say it roughly, but it’s messy and I had a hard figuring out lots of details, and some still don’t make sense to me, like the class name convention for lists and dicts like I didn’t put all Draftail And to conclude, I didn’t really have a choice: I couldn’t base my work on the Draftail extension classes because they don’t exist, and because data is stored as Hallo HTML, not as Draft.js JSON. I didn’t want to introduce a new registration as well, since we obviously don’t want a third system of registering extensions in addition of Hallo’s & Draftail’s ones. Devs would forget to use it in addition of the Draftail & Hallo systems, so the uses UI wouldn’t work well with custom rich text extensions. So how should it work? We can also use it as is and improve/change it later when we drop Hallo. I’m perfectly willing to do so. |
Have now done some refactoring in https://github.com/gasman/wagtail/commits/cleanup/richtext-converter-refactor, so that the I acknowledge that the rich text handling has some severe "bus factor" issues right now... there are some design decisions (e.g. my desire to keep the dbHTML -> frontendHTML and dbHTML <-> editorHTML logic separated) that came from being immersed in the rich text logic for several weeks, but might not be so obvious in the resulting code. We should certainly review (and document!) those decisions at some point... however, given that (I believe) it's a fairly tangential part of this PR, and I don't want to hold up the progress of this feature into 2.1, I think it's best to say "just trust me on this" for now :-) |
|
||
@staticmethod | ||
def get_model(): | ||
return Embed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's logically correct to treat EmbedHandler as being associated with the Embed
model - the Embed
model is more like a cache, and deleting records from that table has no consequence on rich text fields that reference that URL. (Looking at it more abstractly: there's no reason that a linktype/embedtype has to be associated with a model; it's just rewriting one kind of HTML tag into another.)
Happily, I don't think there's any negative consequence from simply returning None
here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this for more than just deleting. The idea is to have a way, through collectors, to find all references of a single object. This gives a lot of possibilities, including the possibility to create a task to clean the embeds from time to time. If a reference to an embed is removed, the object still sits there in the database, useless. On huge sites, that’ll be good to have a command to remove useless embeds from time to time, hence the system to find references for it in rich text & stream fields.
Indeed, linktype/embedtypes can do something without a model, but this is not the case in all the linktype/embedtypes provided by Wagtail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using it for garbage-collecting the embeds table, then calling get_embed
from get_instance
will do exactly the wrong thing... for any embed not found in the table, it will make a round-trip to the embed provider and create a new record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes, but in practice, it will almost never happen. If you have a reference to an embed in a rich text, then an embed database object was created via AJAX when you added the embed to the rich text. If you have a reference to an embed in a StreamField
, then you most likely used it in front-end, which triggered the creation of an embed object in the database.
So unless you created a script to create “ghost” embed references, all embed references will have a corresponding database object in real life cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair point! On a related note, I think we could optimise things further by having the Link/EmbedHandler classes implement a method to check "do these tag attributes correspond to one of these model instances", so that we're not uselessly instantiating every object we encounter in rich text (only to either throw it away or match it to an object we already have). But at that point I'm gift-horsing ;-)
params['val'] = (values[0] if len(values) == 1 | ||
else r'(%s)' % r'|'.join(values)) | ||
else: | ||
pattern = r'<(a|embed) %(type)s[^>]*>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, this case exists to serve a "hidden feature" of find_objects
where passing an empty object list returns all objects of any type referenced within rich text. What's the motivation for this feature? It seems out of place here: it's listing objects rather than finding them; it doesn't match the logical expected behaviour of find_object on an empty list; and it requires quite a bit of special-case code to make it work (and, in particular, conflicts with my earlier assertion that link/embed handlers don't have to be associated with a model)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is indeed an unused feature for now. At first, I started with doing that wildcard search only if searched_objects is None
.
If I remember well, I changed my mind to make a cleaner API using .find_objects(*searched_objects)
instead of .find_objects(searched_objects=None)
. I’m sure I thought “if you search for one or more objects, it searches them, otherwise it searches for every object”. It must have been a brain fart, as I clearly forgot that sometimes we just pass a tuple or list directly without checking whether an item exists or not.
The good news is that it cannot have any impact on the views added, but that’s definitely worth changing. I think that we should just modify line 79 with if searched_objects is None
and reverse the condition, then modify find_objects
, for example by adding a find_all
boolean that could only be True
if there’s no searched_objects
. Or add a _find_objects(searched_objects)
method with all the logic from find_objects
, and create two methods find_objects(*searched_objects)
and find_all_objects()
(that last one passing None
to _find_objects
).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two separate methods for find_objects
and find_all_objects
definitely makes more sense to me - they're functionally very different things. I think get_pattern_for_objects
and get_handlers
can also be split up in the same way, to simplify the logic.
try: | ||
return model._default_manager.get(id=attrs['id']) | ||
except model.DoesNotExist: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BertrandBordage Any objection to leaving the DoesNotExist
exception uncaught here, to be handled by the caller? I think that makes it a bit more idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean put it in find_objects_in_rich_text
? Yeah, that would be valid, I can’t remember a reason why it would be bad.
Rebased at https://github.com/gasman/wagtail/tree/BertrandBordage-new-uses-ui-rebase with the proposed changes (not re-combining the frontendHTML and editorHTML handlers; adding separate methods for |
Great @gasman :) |
wagtail/core/collectors.py
Outdated
for field in self.fields] | ||
|
||
def prepare_value(self, value): | ||
if isinstance(value, Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Reposting this comment as I made it against a commit on my own branch rather than this PR, and not sure if you got the notification, @BertrandBordage...)
Do we ever call find_objects
with values that aren't model instances? If so, it looks like we're missing tests for these cases, and we should also consider renaming find_objects
to find_values
. Personally I'd prefer to remove this functionality if there's no immediate need for it, though - it seems like feature creep, and adds complexity that we could do without. (I'm not entirely clear on why "
and |
characters in strings need to be double-escaped, or why we can't replace most of this method with json.dumps
, and if we can avoid going down that rabbit-hole, we should :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should probably rename it to find_values
. I wrote this as a framework to search any type of data in StreamFields, so you can for example search for references of an URL that no longer exists and replace them manually. I guess it requires additional tests, like you said.
There are multiple reasons why I didn't use json.dumps
. The main reason was performance: it's faster this way and doesn't need to escape |
in other cases than a string (although I realise now we should escape pk
as well). I know, now that we use the Django collector with it, it's completely negligible.
We need to escape quotes because we want to search for "the \"value\""
instead of "the "value""
in the database cell. We escape the pipe because we search for multiple values at once like this: "a value"|"the other one"|"a Bash command using \|"
.
Yes both escapes should also be done for pks, as stated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I'm all in favour of making the collector API broad enough to cover use cases beyond the core functionality as used by the object deletion feature - but as it stands, there's no documentation for this API, or even any samples of calling code that I can infer an API from. This means that as a reviewer it's impossible for me to verify the correctness of the code.
For example, it seems to me that StreamFieldCollector.find_values
has a different concept of what a "value" is: for rich text blocks it's calling find_objects_in_rich_text
, which will only ever return actual model objects, and therefore there's no way of using this method to find URLs within rich text. Is this a bug that directly impacts on the object deletion feature, or a bug in a currently-unused part of the collector API, or a not-yet-implemented part of the API, or something that's completely out of the scope of the API, or does it actually work perfectly thanks to some clever nuance of the code that I've missed? :-) There are numerous questions like this which I can only answer by trying to keep 1000+ lines of code in my head at once...
Ultimately I think the only way I can review the collector code in its current "open-ended" form is for it to be split up further into one commit for each class, with docstrings and tests to define the acceptable kinds of parameters/return values, and for obvious reasons I don't think you want to spend the 2.2 release cycle doing that :-) To get this into 2.1, I think we're going to have to strip it down to the minimal code necessary for the object deletion feature to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so would you be OK with you if that simplification consisted in just removing prepare_value
and just systematically apply json.dumps(v.pk).replace('|', r'\|')
in find_objects
?
Sorry I don’t commit anymore on this branch, I’m not quite sure what to do now that you’ve started your own ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll go ahead and remove prepare_value
and any other code paths that become redundant as a result (I think there are a few if isinstance(value, Model):
switches elsewhere).
I still don't understand why you're manually escaping |
in addition to calling re.escape
. Again I'm struggling with the lack of docstring here - is the return value of prepare_value
intended to be a literal string to match within the json (in which case it's the caller's responsibility to escape that literal string appropriately for use in a regexp, or whatever string-matching mechanism it decides to use), or a regexp pattern (in which case we should call re.replace
on the literal string, since |
is not the only character that has special meaning)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you’re right, the |
escaping must be a mistake given that I added an re.escape
. The only purpose of this method was to prepare the value before injecting it in stream_structure_pattern
, so it had to be escaped so it’s considered as a literal by stream_structure_pattern
.
Very reluctantly bumping this to 2.2 - unfortunately the reviewing work has proceeded slower than I'd hoped, and I don't think I'll be able to complete it without causing excessive delays to the 2.1 release (I'm only in for two days next week). To minimise the impact on @BertrandBordage's Kickstarter work, I'm happy to deal with any code changes that arise from the rest of the review (although, of course, that may entail Bertrand or someone else reviewing my changes...) |
Ouch, OK. No problem for me to review or contribute to the improvements, I still reserved some Wagtail work aside from the Kickstarter work. |
block_types = (ChooserBlock, RichTextBlock) | ||
block_paths_per_collector = [(c, list(c.find_block_type(block_types))) | ||
for c in self.field_collectors] | ||
stream_structure_pattern = r'[[ \t\n:](%s)[] \t\n,}]' % '|'.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the process of trying and failing to update this regexp for readability, I discovered some "fun" quirks of the mysql regexp engine:
- It doesn't understand backslashed character classes such as
\s
or\n
at all, and interprets them as the letter (s
orn
) instead. (It does understand POSIX character classes such as[:space:]
, but the Python module, which is used by sqlite, doesn't.) - Backslashes inside a
[...]
set are handled literally. This means that[[ \t\n:]
will match\
,t
orn
, but not a tab or line break
This means there's no portable way to match all whitespace characters :-( I was tempted to just leave out \t\n
, since the JSON serializer used by StreamField isn't going to insert those characters, but that's the sort of decision that's bound to come back to bite us. So, I guess I'll go ahead and add an explicit check for mysql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s right, I also discovered that a few months ago, but forgot that I used it here
So we have to differentiate between the two cases. Do you think you could make a test for this? I’m not sure if that’s doable without manually modifying the serialized JSON content.
yield from self.find_values(sub_value, block_path) | ||
elif isinstance(stream, StreamValue.StreamChild): | ||
if stream.block == current_block: | ||
yield from self.find_values(stream.value, block_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These recursion rules don't seem quite right: unwrapping StreamValue
and StreamChild
happens over two iterations of find_values
, during which two blocks will be popped off block_path
- but both wrapper elements correspond to a StreamBlock
, so logically only one block should have been popped by that point.
(Looking closer, I think there's some inconsistency about whether the type of stream
corresponds to the first block in block_path
, or the block that's just been popped in the previous iteration, and my guess is that this inconsistency only surfaces in nested StreamBlock
s, which aren't covered by the tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure I fully follow you here.
I started building this recursive method using my knowledge of the way StructBlock
, ListBlock
and StreamBlock
work, but it ended up being more complex, with several intermediate classes (that could probably be refactored). I adjusted the method to work in all the scenarii I could think of. Could you make an example that would not be handled by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing test (throwing an 'Unexpected StreamField value' warning) here: gasman@4ed64fa
You can verify that there's a logic error here by asking: does the value of stream
correspond to the first item of block_path
, or not?
- The line
if isinstance(current_block, StreamBlock) and isinstance(stream, StreamValue):
suggests that it does correspond - The
if not block_path:
section at the top suggests that it doesn't correspond (block_path
is empty, but we still have astream
value) elif current_block.name == '' and isinstance(stream, list):
also suggests that it doesn't correspond (the 'nameless' block in a ListBlock definition is the child, not the ListBlock itself)
Unfortunately at this point I really have to ask you to add docstrings to the remainder of collectors.py before I can continue reviewing. Throughout this code there's a lack of precision about the inputs / outputs of each method... up to now I was willing to approach that as a documentation issue, and fill in that missing detail as part of the review process. However, it now seems that this lack of precision is directly causing logical errors such as this one and the regexp escaping bug earlier - at this point, I don't trust my ability to catch these logical errors at the same time as figuring out the intended behaviour of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping this back to @BertrandBordage to deal with https://github.com/wagtail/wagtail/pull/4481/files#r190932257, which looks like it's going to need some non-trivial refactoring :-/
@gasman: Just added some in-depth docstrings, hope it helps :) |
Updated the PR against master. |
Hi @BertrandBordage - thanks for the added docstrings! Unfortunately we've now ended up with two diverging branches - my branch at https://github.com/gasman/wagtail/tree/BertrandBordage-new-uses-ui-rebase has various fixes from the earlier round of reviewing which aren't included in this PR. Also, https://github.com/wagtail/wagtail/pull/4481/files#r190932257 still needs fixing. I'll create a rebased version of my branch with your docstrings merged in, and open a new PR to replace this one. |
>>> [[b.__class__.__name__ for b in path] | ||
... for path in StreamFieldCollector(stream_field) | ||
... .find_block_type((CharBlock, TextBlock))] | ||
[['StreamBlock', 'ListBlock', 'TextBlock']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be [['StreamBlock', 'CharBlock'], ['StreamBlock', 'ListBlock', 'TextBlock']]
?
Replaced by #4702 |
This pull request replaces #3961.
The heart of this pull request is the same, except that it was rebuild from current master with all the rich text editor API changes. It is split in 5 comprehensible commits. To avoid mistakes, I did all the rebasing work manually, by coding again everything except the new collectors.py file.
I took the decision to change a bit the rich text API for two reasons:
expand_db_attributes
for editor HTML & a top-level function for generation front-end HTML. However,Page
link doesn’t follow this pattern, and I found some name mistakes in the tests due to the confusion introduced by this unclear architectureLinkHandler
, especially since we expect people to reference the link between their custom features and the related models, for the collectors introduced in this PR.And of course, it still looks the same as in #3961. It might be improved, but that’s enough for now given that this is already an important enhancement compared to the current state where we have nothing except the usage count, and 500 errors on protected items deletion.