Improve PATIENT/PERSOON processing and more #20

mkorvas · 2024-07-12T17:16:08Z

This is the result of my first encounter with this codebase (Docdeid and Deduce), the first part (Docdeid). My goal was to understand the inner workings of it and then make sure that capitalized street names are pseudonymized (all-caps or titlecased, and covering also the special case of the "IJ" digraph in Dutch). When at it, I noticed unexpected behaviour for patient names v. other person names and improved that as well.

It's like titlecasing but it touches the word only if it was originally uppercase. There might be better names for it...

...as required by pylint.

This is needed so as to reduce the number of arguments for the `_match_sequence` method and creates a cleaner inheritance hierarchy between annotators, too.

vmenger · 2024-10-23T11:03:04Z

Hi @mkorvas, I finally found time to check this PR, sorry it took this long. I'm just too busy with other things right now.

I really like some of the things you did, especially with the Sequence matching. Thanks a lot for that. I also added some comments, please review them whenever you have the time.

Additionally:

Please ensure all changes are listed in the changelog
Please check all changes are incorporated in the docs
I didn't check the tests yet, will do those later
I just added you as a collaborator to deduce and docdeid, makes it a bit easier to work on this
I think docdeid.process.annotator has gotten a bit too big, we should maybe split some of the Pattern stuff into a separate module, and also deprecate the old token pattern matching logic. We can also fix this later

mkorvas · 2024-10-28T14:27:13Z

Hi @vmenger, thanks for getting back to this PR!

In your message from "5 days ago", you wrote about having added some comments but I am not finding them. Where should I look?

vmenger · 2024-10-28T14:31:58Z

Hi @vmenger, thanks for getting back to this PR!

In your message from "5 days ago", you wrote about having added some comments but I am not finding them. Where should I look?

On the PR page, where I'm assuming you are currently already reading this? :-) Should see the comments when you scroll up (about 30 of them) #20

mkorvas · 2024-10-28T17:30:56Z

Either I am blind or it's not there. This is what I see:

Could it be that Github has the feature of Reviews, as batches of comments that can be submitted all at once, like Gitlab and Bitbucket have, too? That confused me a couple of times in the past, making me believe I had submitted my review comments when they were still in a private-draft status.

vmenger · 2024-10-23T09:13:10Z

docdeid/document.py

+        return self._tokenizers
+
+    @property
+    def token_lists(self) -> Mapping[str, TokenList]:


Can't we just do this?

@property def token_lists(self) -> dict[str, TokenList]: return {tokenizer: self.get_tokens(tokenizer) for tokenizer in self.tokenizers}

Additionally, do we really need this method if it's a oneliner?

You are right, this property/method is not really very useful. I am replacing it with .tokenizers and .get_tokens at the few places where it's currently used.

(If the comment gets submitted immediately -- sorry for the long hiatus! I finally seem to have the time to handle your detailed review, now that some deadlines of last year's December have passed.)

vmenger · 2024-10-23T09:15:19Z

docdeid/document.py

@@ -74,7 +77,7 @@ def __init__(
    ) -> None:

        self._text = text
-        self._tokenizers = tokenizers
+        self._tokenizers = None if tokenizers is None else frozendict(tokenizers)


Is this to make mypy happy?

I guess it's to make sure that any additions to the tokenizers dict made after it was passed to this Document.__init__ method are not going to affect the tokenizers used by this Document instance. I don't see immediately whether this safety measure is necessary but it looks more certainly more correct this way.

(The same argument would apply to metadata -- I find it ugly that the object, the dictionary passed to the Document initializer here can be modified later and the Document instance's metadata field will reflect the modifications. But I didn't have the need to fix that. I would prefer to just use a simple dict instead of the MetaData class in fact.)

vmenger · 2024-10-23T09:18:43Z

docdeid/annotation.py

@@ -126,6 +127,13 @@ class AnnotationSet(set[Annotation]):
    It extends the builtin ``set``.
    """

+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+        # Ugh, this feels like Java 9. (For sake of Mypy:)


Yeah Mypy was more of an experiment for me. It did catch some bugs before releasing so that was nice, but I think its too much hassle to use in future projects

Thanks for sharing your view! This was my first experience with using Mypy and it does have benefits... but sometimes is annoying. I should perhaps disobey its requirements when the benefit of implementing them is outweighted by the effort in doing so or by the lost code readability.

vmenger · 2024-10-23T09:20:28Z

docdeid/annotation.py

+        # docstring of `typing.Optional` says it's equivalent to
+        # `typing.Union[None, _]`:
+        #     if not isinstance(callbacks, Optional[frozendict]):
+        if not isinstance(callbacks, frozendict | None):


Unfortunately using | is only supported from Python 3.10, and since we're still supporting 3.9 I think for now we should use Union only

Good catch! In my development setup, I cheat by using a newer version of Python. It didn't even occur to me this could be an issue for Python backward compatibility.

Now, though, consulting the Python 3.9 docs, it seems that Optional would be an option, too, and a more concise one. Let me thus rewrite the code to that.

vmenger · 2024-10-23T09:21:50Z

docdeid/annotation.py

@@ -179,3 +193,47 @@ def has_overlap(self) -> bool:
                return True

        return False
+
+    import docdeid  # needed to type-annotate the `doc` argument below


Can you put imports at the top of modules only please?

Hm... how to do that without introducing a circular import in this case? We could:

Give up the typing hint to avoid the need for the import altogether.

Remove the imports now done in docdeid/__init__.py -- that would be backward-incompatible and overkill considering the reason why we'd do that.

Find another way of declaring the stringized type annotation, one which doesn't require docdeid to be a known module when evaluating the typing annotation.

I like option 3 the best. I guess I tried already when I wrote the code the first time to use another typing hint than the fully-qualified stringized "docdeid.document.Document" and it probably didn't work. Before I try again, let me explore the option I noticed when I studied the typing docs a short while ago, towards the end of the page -- guarding the import with the condition of typing.TYPE_CHECKING.

Now getting this when trying to verify my solution. I will really have to find a way to revert to running Python 3.9...

vmenger · 2024-10-23T10:33:08Z

docdeid/process/annotator.py

+
+        num_matched = 0
+        end_token = start_token
+        for tok_pattern, end_token in zip(tok_patterns, tokens):


As we don't really need the count, we can also iterate using all?

Ok nvm, we need the start and end token

vmenger · 2024-10-23T10:40:51Z

docdeid/process/annotator.py

+            return not cls._lookup(value, **kwargs)
+        if func == "tag":
+            annos = kwargs.get("annos", ())
+            return any(anno.tag == value for anno in annos)


Ok, now I finally get why we needed this functionality :)

vmenger · 2024-10-23T10:44:36Z

docdeid/process/annotator.py

+                meta_val = getattr(kwargs["metadata"][meta_key], meta_attr)
+            except (TypeError, KeyError, AttributeError):
+                return False
+            return token == meta_val if isinstance(meta_val, str) else token in meta_val


Curious what this does, but I guess we'll find out in the Deduce PR

This can be used to detect a sequence of given names of a patient that come from the document metadata, especially when the names are uncommon and so static name dictionaries wouldn't work so well. The config file would then declare a token matcher like this:

{ "lookup": "patient.naam" }

vmenger · 2024-10-23T10:45:48Z

docdeid/process/annotator.py

@@ -1,21 +1,65 @@
+from __future__ import annotations


What specifically do we need this for?

I already forgot. But as the docs put it:

The only feature that requires using the future statement is annotations (see PEP 563).

Commenting it out results in this error in particular (caused by a forward reference to a type:

docdeid/process/annotator.py:44: in NestedTokenPattern pattern: list[TokenPatternFromCfg] E NameError: name 'TokenPatternFromCfg' is not defined

vmenger · 2024-10-23T10:50:03Z

docdeid/process/annotator.py

@@ -102,85 +209,46 @@ def annotate(self, doc: Document) -> list[Annotation]:

 class MultiTokenLookupAnnotator(Annotator):


I know Deduce is the main user of this library, but I don't think we should just remove functionality and hope no one was using it, lol

You are right! I see I reimplemented that logic in Deduce but forgot there might be other users of Docdeid. I am adding it back here.

vmenger · 2024-10-28T17:35:53Z

@mkorvas Ah they were still in limbo indeed, submitted them just now!

This syntax is not supported in Python 3.9.

mkorvas · 2025-01-09T20:42:09Z

I believe I have addressed all your outstanding comments, the PR should be ready for another round of review.

matej-ibis-ai added 28 commits March 1, 2024 22:29

Make MultiTok..Annotator notice changes in the trie

30ce936

Provide LowercaseTail string modifier

e12d0d0

It's like titlecasing but it touches the word only if it was originally uppercase. There might be better names for it...

Enable specifying the lang for titlecasing

82aab5f

Minimize data fixtures for tests

156e201

Log annotated text after every processor

c7b4c89

Update documentation slightly

4459c14

Expose Document.token_lists as a property

810b8b3

(Almost) automatically format code

5002696

Simplify MultiTokenLookupAnnotator...

7d2d866

...as required by pylint.

Update the MultiTok...Annotator docstring

762866a

Test user additions to the lookup trie

1ae6846

Test the tokenizers and token_lists props

ae1f93e

Remove and ignore the IDEA project file

d415f51

Annotate docs for logging only if level is DEBUG

d8e8ed3

Cosmetics

03fc99d

Support whitespace trimming in WordBoundaryTokenizer

5d188cd

Move SequenceTokenizer to Docdeid

6ea9b74

This is needed so as to reduce the number of arguments for the `_match_sequence` method and creates a cleaner inheritance hierarchy between annotators, too.

Format code

4110a53

Replace _DIRECTION_MAP with an enum

df73e54

Improve and test annos_by_token()

99163d6

Drop Token.get_nth, simplify Token.iter_to

c7ba5bc

Format code

c80e2ad

Test and fix Direction

40fcd62

Fix Flake8-reported errors

15b8648

Address most non-Mypy lint issues

ebdefa4

Address easy and valid Mypy issues

4a082b8

Add a test for keep_blanks=False in WBTokenizer

3319df1

Document how to run tests better + cosmetics

1afb16f

mkorvas mentioned this pull request Jul 12, 2024

Improve PATIENT/PERSOON processing and more vmenger/deduce#144

Open

vmenger requested changes Oct 28, 2024

View reviewed changes

matej-ibis-ai added 12 commits January 7, 2025 17:28

Drop the Document.token_lists property

53db956

Avoid "|" for union types

230c507

This syntax is not supported in Python 3.9.

Move annos_by_token to Document

25cbcfd

Simplify Direction.from_string

36eb1e3

Rename SequenceAnnotator.dicts to ds

573deff

Replace list(map(f, xs)) with list comprehension

a2704c5

Re-add MultiTokenLookupAnnotator accepting a LookupSet

3ca37aa

Add a test for matching multi-word phrases

68f4afb

Try to support multi-word matching in SequenceAnnotator

fb3cbd8

Give up multi-word matching in SequenceAnnotator

0c04a78

Move seq pattern validation to a new method

82c52fc

Polish the code a little

9dcc4f0

mkorvas requested a review from vmenger January 9, 2025 20:42

Don't fail validation on refs to metadata

659a694

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PATIENT/PERSOON processing and more #20

Improve PATIENT/PERSOON processing and more #20

mkorvas commented Jul 12, 2024

vmenger commented Oct 23, 2024 •

edited

Loading

mkorvas commented Oct 28, 2024

vmenger commented Oct 28, 2024

mkorvas commented Oct 28, 2024

vmenger Oct 23, 2024

vmenger Oct 23, 2024

mkorvas Jan 7, 2025

vmenger Oct 23, 2024

mkorvas Jan 7, 2025

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

mkorvas Jan 8, 2025

vmenger Oct 23, 2024

vmenger Oct 23, 2024

vmenger Oct 23, 2024

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

vmenger Oct 23, 2024

mkorvas Jan 8, 2025

vmenger commented Oct 28, 2024

mkorvas commented Jan 9, 2025

		@@ -102,85 +209,46 @@ def annotate(self, doc: Document) -> list[Annotation]:

		class MultiTokenLookupAnnotator(Annotator):

Improve PATIENT/PERSOON processing and more #20

Are you sure you want to change the base?

Improve PATIENT/PERSOON processing and more #20

Conversation

mkorvas commented Jul 12, 2024

vmenger commented Oct 23, 2024 • edited Loading

mkorvas commented Oct 28, 2024

vmenger commented Oct 28, 2024

mkorvas commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmenger commented Oct 28, 2024

mkorvas commented Jan 9, 2025

vmenger commented Oct 23, 2024 •

edited

Loading