-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PATIENT/PERSOON processing and more #20
base: main
Are you sure you want to change the base?
Conversation
It's like titlecasing but it touches the word only if it was originally uppercase. There might be better names for it...
...as required by pylint.
This is needed so as to reduce the number of arguments for the `_match_sequence` method and creates a cleaner inheritance hierarchy between annotators, too.
Hi @mkorvas, I finally found time to check this PR, sorry it took this long. I'm just too busy with other things right now. I really like some of the things you did, especially with the Sequence matching. Thanks a lot for that. I also added some comments, please review them whenever you have the time. Additionally:
|
Hi @vmenger, thanks for getting back to this PR! In your message from "5 days ago", you wrote about having added some comments but I am not finding them. Where should I look? |
On the PR page, where I'm assuming you are currently already reading this? :-) Should see the comments when you scroll up (about 30 of them) #20 |
docdeid/document.py
Outdated
return self._tokenizers | ||
|
||
@property | ||
def token_lists(self) -> Mapping[str, TokenList]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just do this?
@property
def token_lists(self) -> dict[str, TokenList]:
return {tokenizer: self.get_tokens(tokenizer) for tokenizer in self.tokenizers}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, do we really need this method if it's a oneliner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this property/method is not really very useful. I am replacing it with .tokenizers
and .get_tokens
at the few places where it's currently used.
(If the comment gets submitted immediately -- sorry for the long hiatus! I finally seem to have the time to handle your detailed review, now that some deadlines of last year's December have passed.)
@@ -74,7 +77,7 @@ def __init__( | |||
) -> None: | |||
|
|||
self._text = text | |||
self._tokenizers = tokenizers | |||
self._tokenizers = None if tokenizers is None else frozendict(tokenizers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to make mypy happy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's to make sure that any additions to the tokenizers
dict made after it was passed to this Document.__init__
method are not going to affect the tokenizers used by this Document
instance. I don't see immediately whether this safety measure is necessary but it looks more certainly more correct this way.
(The same argument would apply to metadata
-- I find it ugly that the object, the dictionary passed to the Document
initializer here can be modified later and the Document
instance's metadata
field will reflect the modifications. But I didn't have the need to fix that. I would prefer to just use a simple dict instead of the MetaData
class in fact.)
docdeid/annotation.py
Outdated
@@ -126,6 +127,13 @@ class AnnotationSet(set[Annotation]): | |||
It extends the builtin ``set``. | |||
""" | |||
|
|||
def __init__(self, *args, **kwargs) -> None: | |||
super().__init__(*args, **kwargs) | |||
# Ugh, this feels like Java 9. (For sake of Mypy:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah Mypy was more of an experiment for me. It did catch some bugs before releasing so that was nice, but I think its too much hassle to use in future projects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing your view! This was my first experience with using Mypy and it does have benefits... but sometimes is annoying. I should perhaps disobey its requirements when the benefit of implementing them is outweighted by the effort in doing so or by the lost code readability.
docdeid/annotation.py
Outdated
# docstring of `typing.Optional` says it's equivalent to | ||
# `typing.Union[None, _]`: | ||
# if not isinstance(callbacks, Optional[frozendict]): | ||
if not isinstance(callbacks, frozendict | None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately using | is only supported from Python 3.10, and since we're still supporting 3.9 I think for now we should use Union only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! In my development setup, I cheat by using a newer version of Python. It didn't even occur to me this could be an issue for Python backward compatibility.
Now, though, consulting the Python 3.9 docs, it seems that Optional
would be an option, too, and a more concise one. Let me thus rewrite the code to that.
docdeid/annotation.py
Outdated
@@ -179,3 +193,47 @@ def has_overlap(self) -> bool: | |||
return True | |||
|
|||
return False | |||
|
|||
import docdeid # needed to type-annotate the `doc` argument below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put imports at the top of modules only please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... how to do that without introducing a circular import in this case? We could:
- Give up the typing hint to avoid the need for the import altogether.
- Remove the imports now done in
docdeid/__init__.py
-- that would be backward-incompatible and overkill considering the reason why we'd do that. - Find another way of declaring the stringized type annotation, one which doesn't require
docdeid
to be a known module when evaluating the typing annotation.
I like option 3 the best. I guess I tried already when I wrote the code the first time to use another typing hint than the fully-qualified stringized "docdeid.document.Document"
and it probably didn't work. Before I try again, let me explore the option I noticed when I studied the typing docs a short while ago, towards the end of the page -- guarding the import with the condition of typing.TYPE_CHECKING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now getting this when trying to verify my solution. I will really have to find a way to revert to running Python 3.9...
|
||
num_matched = 0 | ||
end_token = start_token | ||
for tok_pattern, end_token in zip(tok_patterns, tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't really need the count, we can also iterate using all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok nvm, we need the start and end token
return not cls._lookup(value, **kwargs) | ||
if func == "tag": | ||
annos = kwargs.get("annos", ()) | ||
return any(anno.tag == value for anno in annos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I finally get why we needed this functionality :)
meta_val = getattr(kwargs["metadata"][meta_key], meta_attr) | ||
except (TypeError, KeyError, AttributeError): | ||
return False | ||
return token == meta_val if isinstance(meta_val, str) else token in meta_val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what this does, but I guess we'll find out in the Deduce PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be used to detect a sequence of given names of a patient that come from the document metadata, especially when the names are uncommon and so static name dictionaries wouldn't work so well. The config file would then declare a token matcher like this:
{
"lookup": "patient.naam"
}
@@ -1,21 +1,65 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What specifically do we need this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already forgot. But as the docs put it:
The only feature that requires using the future statement is annotations (see PEP 563).
Commenting it out results in this error in particular (caused by a forward reference to a type:
docdeid/process/annotator.py:44: in NestedTokenPattern
pattern: list[TokenPatternFromCfg]
E NameError: name 'TokenPatternFromCfg' is not defined
docdeid/process/annotator.py
Outdated
@@ -102,85 +209,46 @@ def annotate(self, doc: Document) -> list[Annotation]: | |||
|
|||
class MultiTokenLookupAnnotator(Annotator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Deduce is the main user of this library, but I don't think we should just remove functionality and hope no one was using it, lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! I see I reimplemented that logic in Deduce but forgot there might be other users of Docdeid. I am adding it back here.
@mkorvas Ah they were still in limbo indeed, submitted them just now! |
This syntax is not supported in Python 3.9.
I believe I have addressed all your outstanding comments, the PR should be ready for another round of review. |
This is the result of my first encounter with this codebase (Docdeid and Deduce), the first part (Docdeid). My goal was to understand the inner workings of it and then make sure that capitalized street names are pseudonymized (all-caps or titlecased, and covering also the special case of the "IJ" digraph in Dutch). When at it, I noticed unexpected behaviour for patient names v. other person names and improved that as well.