-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PATIENT/PERSOON processing and more #20
Open
mkorvas
wants to merge
41
commits into
vmenger:main
Choose a base branch
from
mkorvas:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
30ce936
Make MultiTok..Annotator notice changes in the trie
matej-ibis-ai e12d0d0
Provide `LowercaseTail` string modifier
matej-ibis-ai 82aab5f
Enable specifying the lang for titlecasing
matej-ibis-ai 156e201
Minimize data fixtures for tests
matej-ibis-ai c7b4c89
Log annotated text after every processor
matej-ibis-ai 4459c14
Update documentation slightly
matej-ibis-ai 810b8b3
Expose `Document.token_lists` as a property
matej-ibis-ai 5002696
(Almost) automatically format code
matej-ibis-ai 7d2d866
Simplify `MultiTokenLookupAnnotator`...
matej-ibis-ai 762866a
Update the `MultiTok...Annotator` docstring
matej-ibis-ai 1ae6846
Test user additions to the lookup trie
matej-ibis-ai ae1f93e
Test the `tokenizers` and `token_lists` props
matej-ibis-ai d415f51
Remove and ignore the IDEA project file
matej-ibis-ai d8e8ed3
Annotate docs for logging only if level is DEBUG
matej-ibis-ai 03fc99d
Cosmetics
matej-ibis-ai 5d188cd
Support whitespace trimming in `WordBoundaryTokenizer`
matej-ibis-ai 6ea9b74
Move `SequenceTokenizer` to Docdeid
matej-ibis-ai 4110a53
Format code
matej-ibis-ai df73e54
Replace `_DIRECTION_MAP` with an enum
matej-ibis-ai 99163d6
Improve and test `annos_by_token()`
matej-ibis-ai c7ba5bc
Drop `Token.get_nth`, simplify `Token.iter_to`
matej-ibis-ai c80e2ad
Format code
matej-ibis-ai 40fcd62
Test and fix `Direction`
matej-ibis-ai 15b8648
Fix Flake8-reported errors
matej-ibis-ai ebdefa4
Address most non-Mypy lint issues
matej-ibis-ai 4a082b8
Address easy and valid Mypy issues
matej-ibis-ai 3319df1
Add a test for keep_blanks=False in WBTokenizer
matej-ibis-ai 1afb16f
Document how to run tests better + cosmetics
matej-ibis-ai 53db956
Drop the `Document.token_lists` property
matej-ibis-ai 230c507
Avoid "|" for union types
matej-ibis-ai 25cbcfd
Move `annos_by_token` to `Document`
matej-ibis-ai 36eb1e3
Simplify `Direction.from_string`
matej-ibis-ai 573deff
Rename `SequenceAnnotator.dicts` to `ds`
matej-ibis-ai a2704c5
Replace `list(map(f, xs))` with list comprehension
matej-ibis-ai 3ca37aa
Re-add `MultiTokenLookupAnnotator` accepting a `LookupSet`
matej-ibis-ai 68f4afb
Add a test for matching multi-word phrases
matej-ibis-ai fb3cbd8
Try to support multi-word matching in SequenceAnnotator
matej-ibis-ai 0c04a78
Give up multi-word matching in SequenceAnnotator
matej-ibis-ai 82c52fc
Move seq pattern validation to a new method
matej-ibis-ai 9dcc4f0
Polish the code a little
matej-ibis-ai 659a694
Don't fail validation on refs to metadata
matej-ibis-ai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Log annotated text after every processor
- Loading branch information
commit c7b4c896f1ed3718a118d70cae5156d7b481b1ff
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
from collections import defaultdict | ||
|
||
from frozendict import frozendict | ||
|
||
from docdeid.document import Document | ||
|
@@ -32,3 +34,32 @@ def annotate_intext(doc: Document) -> str: | |
) | ||
|
||
return text | ||
|
||
|
||
def annotate_doc(doc: Document) -> str: | ||
"""\ | ||
Adds XML-like markup for annotations into the text of a document. | ||
|
||
Handles also nested mentions and in a way also overlapping mentions, even | ||
though this kind of markup cannot really represent them. | ||
""" | ||
annos_from_shortest = sorted( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can do:
|
||
doc.annotations, | ||
key=lambda anno: anno.end_char - anno.start_char) | ||
idx_to_anno_starts = defaultdict(list) | ||
idx_to_anno_ends = defaultdict(list) | ||
for anno in annos_from_shortest: | ||
idx_to_anno_starts[anno.start_char].append(anno) | ||
idx_to_anno_ends[anno.end_char].append(anno) | ||
markup_indices = sorted(set(idx_to_anno_starts).union(idx_to_anno_ends)) | ||
chunks = list() | ||
last_idx = 0 | ||
for idx in markup_indices: | ||
chunks.append(doc.text[last_idx:idx]) | ||
for ending_anno in idx_to_anno_ends[idx]: | ||
chunks.append(f'</{ending_anno.tag.upper()}>') | ||
for starting_anno in reversed(idx_to_anno_starts[idx]): | ||
chunks.append(f'<{starting_anno.tag.upper()}>') | ||
last_idx = idx | ||
chunks.append(doc.text[last_idx:]) | ||
return ''.join(chunks) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so this is like the above function annotate_intext, but a bit more elaborate? Should we just merge them? In any case this needs a bit more descriptive name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's the case. Let me replace the body of
annotate_intext
with a call toannotate_doc
. Or would you already deprecate the former method explicitly (using the deprecation package perhaps) or even simply remove it?