Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PATIENT/PERSOON processing and more #20

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

mkorvas
Copy link

@mkorvas mkorvas commented Jul 12, 2024

This is the result of my first encounter with this codebase (Docdeid and Deduce), the first part (Docdeid). My goal was to understand the inner workings of it and then make sure that capitalized street names are pseudonymized (all-caps or titlecased, and covering also the special case of the "IJ" digraph in Dutch). When at it, I noticed unexpected behaviour for patient names v. other person names and improved that as well.

It's like titlecasing but it touches the word only if it was
originally uppercase. There might be better names for it...
...as required by pylint.
This is needed so as to reduce the number of arguments
for the `_match_sequence` method and creates a cleaner
inheritance hierarchy between annotators, too.
@vmenger
Copy link
Owner

vmenger commented Oct 23, 2024

Hi @mkorvas, I finally found time to check this PR, sorry it took this long. I'm just too busy with other things right now.

I really like some of the things you did, especially with the Sequence matching. Thanks a lot for that. I also added some comments, please review them whenever you have the time.

Additionally:

  • Please ensure all changes are listed in the changelog
  • Please check all changes are incorporated in the docs
  • I didn't check the tests yet, will do those later
  • I just added you as a collaborator to deduce and docdeid, makes it a bit easier to work on this
  • I think docdeid.process.annotator has gotten a bit too big, we should maybe split some of the Pattern stuff into a separate module, and also deprecate the old token pattern matching logic. We can also fix this later

@mkorvas
Copy link
Author

mkorvas commented Oct 28, 2024

Hi @vmenger, thanks for getting back to this PR!

In your message from "5 days ago", you wrote about having added some comments but I am not finding them. Where should I look?

@vmenger
Copy link
Owner

vmenger commented Oct 28, 2024

Hi @vmenger, thanks for getting back to this PR!

In your message from "5 days ago", you wrote about having added some comments but I am not finding them. Where should I look?

On the PR page, where I'm assuming you are currently already reading this? :-) Should see the comments when you scroll up (about 30 of them) #20

@mkorvas
Copy link
Author

mkorvas commented Oct 28, 2024

Either I am blind or it's not there. This is what I see:
image

Could it be that Github has the feature of Reviews, as batches of comments that can be submitted all at once, like Gitlab and Bitbucket have, too? That confused me a couple of times in the past, making me believe I had submitted my review comments when they were still in a private-draft status.

return self._tokenizers

@property
def token_lists(self) -> Mapping[str, TokenList]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do this?

@property
def token_lists(self) -> dict[str, TokenList]:
    return {tokenizer: self.get_tokens(tokenizer) for tokenizer in self.tokenizers}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, do we really need this method if it's a oneliner?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this property/method is not really very useful. I am replacing it with .tokenizers and .get_tokens at the few places where it's currently used.

(If the comment gets submitted immediately -- sorry for the long hiatus! I finally seem to have the time to handle your detailed review, now that some deadlines of last year's December have passed.)

@@ -74,7 +77,7 @@ def __init__(
) -> None:

self._text = text
self._tokenizers = tokenizers
self._tokenizers = None if tokenizers is None else frozendict(tokenizers)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to make mypy happy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's to make sure that any additions to the tokenizers dict made after it was passed to this Document.__init__ method are not going to affect the tokenizers used by this Document instance. I don't see immediately whether this safety measure is necessary but it looks more certainly more correct this way.

(The same argument would apply to metadata -- I find it ugly that the object, the dictionary passed to the Document initializer here can be modified later and the Document instance's metadata field will reflect the modifications. But I didn't have the need to fix that. I would prefer to just use a simple dict instead of the MetaData class in fact.)

@@ -126,6 +127,13 @@ class AnnotationSet(set[Annotation]):
It extends the builtin ``set``.
"""

def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
# Ugh, this feels like Java 9. (For sake of Mypy:)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Mypy was more of an experiment for me. It did catch some bugs before releasing so that was nice, but I think its too much hassle to use in future projects

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your view! This was my first experience with using Mypy and it does have benefits... but sometimes is annoying. I should perhaps disobey its requirements when the benefit of implementing them is outweighted by the effort in doing so or by the lost code readability.

# docstring of `typing.Optional` says it's equivalent to
# `typing.Union[None, _]`:
# if not isinstance(callbacks, Optional[frozendict]):
if not isinstance(callbacks, frozendict | None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately using | is only supported from Python 3.10, and since we're still supporting 3.9 I think for now we should use Union only

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! In my development setup, I cheat by using a newer version of Python. It didn't even occur to me this could be an issue for Python backward compatibility.

Now, though, consulting the Python 3.9 docs, it seems that Optional would be an option, too, and a more concise one. Let me thus rewrite the code to that.

@@ -179,3 +193,47 @@ def has_overlap(self) -> bool:
return True

return False

import docdeid # needed to type-annotate the `doc` argument below
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put imports at the top of modules only please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... how to do that without introducing a circular import in this case? We could:

  1. Give up the typing hint to avoid the need for the import altogether.
  2. Remove the imports now done in docdeid/__init__.py -- that would be backward-incompatible and overkill considering the reason why we'd do that.
  3. Find another way of declaring the stringized type annotation, one which doesn't require docdeid to be a known module when evaluating the typing annotation.

I like option 3 the best. I guess I tried already when I wrote the code the first time to use another typing hint than the fully-qualified stringized "docdeid.document.Document" and it probably didn't work. Before I try again, let me explore the option I noticed when I studied the typing docs a short while ago, towards the end of the page -- guarding the import with the condition of typing.TYPE_CHECKING.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now getting this when trying to verify my solution. I will really have to find a way to revert to running Python 3.9...


num_matched = 0
end_token = start_token
for tok_pattern, end_token in zip(tok_patterns, tokens):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't really need the count, we can also iterate using all?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nvm, we need the start and end token

return not cls._lookup(value, **kwargs)
if func == "tag":
annos = kwargs.get("annos", ())
return any(anno.tag == value for anno in annos)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I finally get why we needed this functionality :)

meta_val = getattr(kwargs["metadata"][meta_key], meta_attr)
except (TypeError, KeyError, AttributeError):
return False
return token == meta_val if isinstance(meta_val, str) else token in meta_val
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what this does, but I guess we'll find out in the Deduce PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used to detect a sequence of given names of a patient that come from the document metadata, especially when the names are uncommon and so static name dictionaries wouldn't work so well. The config file would then declare a token matcher like this:

{
    "lookup": "patient.naam"
}

@@ -1,21 +1,65 @@
from __future__ import annotations
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specifically do we need this for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already forgot. But as the docs put it:

The only feature that requires using the future statement is annotations (see PEP 563).

Commenting it out results in this error in particular (caused by a forward reference to a type:

docdeid/process/annotator.py:44: in NestedTokenPattern
    pattern: list[TokenPatternFromCfg]
E   NameError: name 'TokenPatternFromCfg' is not defined

@@ -102,85 +209,46 @@ def annotate(self, doc: Document) -> list[Annotation]:

class MultiTokenLookupAnnotator(Annotator):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Deduce is the main user of this library, but I don't think we should just remove functionality and hope no one was using it, lol

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I see I reimplemented that logic in Deduce but forgot there might be other users of Docdeid. I am adding it back here.

@vmenger
Copy link
Owner

vmenger commented Oct 28, 2024

@mkorvas Ah they were still in limbo indeed, submitted them just now!

@mkorvas
Copy link
Author

mkorvas commented Jan 9, 2025

I believe I have addressed all your outstanding comments, the PR should be ready for another round of review.

@mkorvas mkorvas requested a review from vmenger January 9, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants