Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PATIENT/PERSOON processing and more #20

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
30ce936
Make MultiTok..Annotator notice changes in the trie
matej-ibis-ai Mar 1, 2024
e12d0d0
Provide `LowercaseTail` string modifier
matej-ibis-ai Mar 1, 2024
82aab5f
Enable specifying the lang for titlecasing
matej-ibis-ai Mar 4, 2024
156e201
Minimize data fixtures for tests
matej-ibis-ai Mar 4, 2024
c7b4c89
Log annotated text after every processor
matej-ibis-ai Mar 6, 2024
4459c14
Update documentation slightly
matej-ibis-ai Mar 6, 2024
810b8b3
Expose `Document.token_lists` as a property
matej-ibis-ai Mar 6, 2024
5002696
(Almost) automatically format code
matej-ibis-ai Mar 7, 2024
7d2d866
Simplify `MultiTokenLookupAnnotator`...
matej-ibis-ai Mar 7, 2024
762866a
Update the `MultiTok...Annotator` docstring
matej-ibis-ai Mar 8, 2024
1ae6846
Test user additions to the lookup trie
matej-ibis-ai Mar 8, 2024
ae1f93e
Test the `tokenizers` and `token_lists` props
matej-ibis-ai Mar 8, 2024
d415f51
Remove and ignore the IDEA project file
matej-ibis-ai Mar 8, 2024
d8e8ed3
Annotate docs for logging only if level is DEBUG
matej-ibis-ai Mar 8, 2024
03fc99d
Cosmetics
matej-ibis-ai Mar 8, 2024
5d188cd
Support whitespace trimming in `WordBoundaryTokenizer`
matej-ibis-ai Mar 11, 2024
6ea9b74
Move `SequenceTokenizer` to Docdeid
matej-ibis-ai Mar 11, 2024
4110a53
Format code
matej-ibis-ai Mar 11, 2024
df73e54
Replace `_DIRECTION_MAP` with an enum
matej-ibis-ai Mar 11, 2024
99163d6
Improve and test `annos_by_token()`
matej-ibis-ai Mar 11, 2024
c7ba5bc
Drop `Token.get_nth`, simplify `Token.iter_to`
matej-ibis-ai Mar 12, 2024
c80e2ad
Format code
matej-ibis-ai Mar 12, 2024
40fcd62
Test and fix `Direction`
matej-ibis-ai Mar 12, 2024
15b8648
Fix Flake8-reported errors
matej-ibis-ai Mar 12, 2024
ebdefa4
Address most non-Mypy lint issues
matej-ibis-ai Mar 12, 2024
4a082b8
Address easy and valid Mypy issues
matej-ibis-ai Mar 12, 2024
3319df1
Add a test for keep_blanks=False in WBTokenizer
matej-ibis-ai Jul 12, 2024
1afb16f
Document how to run tests better + cosmetics
matej-ibis-ai Jul 12, 2024
53db956
Drop the `Document.token_lists` property
matej-ibis-ai Jan 7, 2025
230c507
Avoid "|" for union types
matej-ibis-ai Jan 8, 2025
25cbcfd
Move `annos_by_token` to `Document`
matej-ibis-ai Jan 8, 2025
36eb1e3
Simplify `Direction.from_string`
matej-ibis-ai Jan 8, 2025
573deff
Rename `SequenceAnnotator.dicts` to `ds`
matej-ibis-ai Jan 8, 2025
a2704c5
Replace `list(map(f, xs))` with list comprehension
matej-ibis-ai Jan 8, 2025
3ca37aa
Re-add `MultiTokenLookupAnnotator` accepting a `LookupSet`
matej-ibis-ai Jan 8, 2025
68f4afb
Add a test for matching multi-word phrases
matej-ibis-ai Jan 9, 2025
fb3cbd8
Try to support multi-word matching in SequenceAnnotator
matej-ibis-ai Jan 9, 2025
0c04a78
Give up multi-word matching in SequenceAnnotator
matej-ibis-ai Jan 9, 2025
82c52fc
Move seq pattern validation to a new method
matej-ibis-ai Jan 9, 2025
9dcc4f0
Polish the code a little
matej-ibis-ai Jan 9, 2025
659a694
Don't fail validation on refs to metadata
matej-ibis-ai Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Log annotated text after every processor
  • Loading branch information
matej-ibis-ai committed Mar 6, 2024
commit c7b4c896f1ed3718a118d70cae5156d7b481b1ff
4 changes: 4 additions & 0 deletions docdeid/process/doc_processor.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import logging
from abc import ABC, abstractmethod
from collections import OrderedDict
from typing import Iterator, Optional, Union

from docdeid.document import Document
from docdeid.utils import annotate_doc


class DocProcessor(ABC): # pylint: disable=R0903
Expand Down Expand Up @@ -143,6 +145,8 @@ def process(
elif isinstance(proc, DocProcessorGroup):
proc.process(doc, enabled=enabled, disabled=disabled)

logging.debug("after %s: %s", name, annotate_doc(doc))

def __iter__(self) -> Iterator:

return iter(self._processors.items())
31 changes: 31 additions & 0 deletions docdeid/utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from collections import defaultdict

from frozendict import frozendict

from docdeid.document import Document
Expand Down Expand Up @@ -32,3 +34,32 @@ def annotate_intext(doc: Document) -> str:
)

return text


def annotate_doc(doc: Document) -> str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this is like the above function annotate_intext, but a bit more elaborate? Should we just merge them? In any case this needs a bit more descriptive name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's the case. Let me replace the body of annotate_intext with a call to annotate_doc. Or would you already deprecate the former method explicitly (using the deprecation package perhaps) or even simply remove it?

"""\
Adds XML-like markup for annotations into the text of a document.

Handles also nested mentions and in a way also overlapping mentions, even
though this kind of markup cannot really represent them.
"""
annos_from_shortest = sorted(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do:

annos_from_shorted = doc.annotations.sorted(by=("length",))

doc.annotations,
key=lambda anno: anno.end_char - anno.start_char)
idx_to_anno_starts = defaultdict(list)
idx_to_anno_ends = defaultdict(list)
for anno in annos_from_shortest:
idx_to_anno_starts[anno.start_char].append(anno)
idx_to_anno_ends[anno.end_char].append(anno)
markup_indices = sorted(set(idx_to_anno_starts).union(idx_to_anno_ends))
chunks = list()
last_idx = 0
for idx in markup_indices:
chunks.append(doc.text[last_idx:idx])
for ending_anno in idx_to_anno_ends[idx]:
chunks.append(f'</{ending_anno.tag.upper()}>')
for starting_anno in reversed(idx_to_anno_starts[idx]):
chunks.append(f'<{starting_anno.tag.upper()}>')
last_idx = idx
chunks.append(doc.text[last_idx:])
return ''.join(chunks)