Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.0 #18

Merged
merged 45 commits into from
Dec 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a684463
Rename tokenize module
vmenger Nov 30, 2023
4b02432
Add token index
vmenger Nov 30, 2023
6fb99e0
Add frozendict dependency
vmenger Dec 1, 2023
a178872
Speedup annotation sorting
vmenger Dec 1, 2023
2e01493
Speedup regexp annotator
vmenger Dec 1, 2023
903f540
Speedup single token annotator
vmenger Dec 1, 2023
949ec06
Fix test
vmenger Dec 1, 2023
1dae5f3
Speedup multi token lookup annotator
vmenger Dec 1, 2023
7e91a06
Add offset option
vmenger Dec 1, 2023
e66c038
Add repr for string processor
vmenger Dec 1, 2023
dde29cb
Move link token logic
vmenger Dec 1, 2023
ab5cc70
Add token lookup logic with matching pipeline
vmenger Dec 1, 2023
b0c493f
Update offset to start_i
vmenger Dec 1, 2023
608c3f5
Update previous and next tests
vmenger Dec 1, 2023
3ef03ee
Update get_words and token_lookup tests
vmenger Dec 1, 2023
79f6190
Formatting
vmenger Dec 1, 2023
eeffb20
Cleanup annotator code
vmenger Dec 1, 2023
280e7a8
Cleanup tokenizer code
vmenger Dec 1, 2023
f4d21ef
Formatting
vmenger Dec 1, 2023
fcc8873
Rename instance var
vmenger Dec 1, 2023
798ddfc
Add info when not presenting callbacks as frozendicht
vmenger Dec 1, 2023
60f58b6
Update changelog
vmenger Dec 1, 2023
b43fc0f
Update dependencies
vmenger Dec 1, 2023
4079155
Optimize caching
vmenger Dec 4, 2023
bd65b0a
Optimize caching
vmenger Dec 4, 2023
b03108b
Add option to directly add trie to multi token lookup
vmenger Dec 5, 2023
61d6909
Update formatting and linting
vmenger Dec 7, 2023
51aea0f
Update formatting and linting
vmenger Dec 7, 2023
0a0624b
Linting
vmenger Dec 7, 2023
e4998ad
Formattin
vmenger Dec 8, 2023
a72ba4b
Improve pattern
vmenger Dec 8, 2023
fbdbd55
Improve processor and processor group abstraction
vmenger Dec 8, 2023
936a990
Rename files
vmenger Dec 8, 2023
8b486a7
Formatting
vmenger Dec 8, 2023
7945cb0
Order test classes
vmenger Dec 8, 2023
9587342
Move dev dependency
vmenger Dec 8, 2023
0a2b8f2
Update changelog
vmenger Dec 8, 2023
43c8241
Remove lcov file
vmenger Dec 8, 2023
46de845
Update docs
vmenger Dec 8, 2023
a9b9bd7
Update changlog
vmenger Dec 8, 2023
f3c15d2
Update token serializing
vmenger Dec 8, 2023
528a5b9
Rename pre_match_tokens internally
vmenger Dec 11, 2023
1263c37
Fix typo
vmenger Dec 12, 2023
dcb36bc
Use casefold instead of lower
vmenger Dec 13, 2023
a12775c
Prepare release
vmenger Dec 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ name: build

on:
workflow_dispatch:
release:
types: [published]

jobs:

Expand Down Expand Up @@ -34,7 +36,7 @@ jobs:
poetry install --only dev

- name: Test build
run: make test
run: python -m pytest .

- name: Set up Pypi credentials
run: poetry config pypi-token.pypi ${{ SECRETS.PYPI_TOKEN }}
Expand Down
20 changes: 16 additions & 4 deletions .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,20 @@ jobs:
- name: Install dependencies
run: poetry install

- name: Check formatting
run: make format CHECK=1
- name: black
run: python -m black . --check

- name: Check linting
run: make lint CHECK=1
- name: isort
run: python -m isort . -c

- name: docformatter
run: python -m docformatter . --check

- name: flake8
run: python -m flake8 .

- name: pylint
run: python -m pylint docdeid/

- name: mypy
run: python -m mypy docdeid/
36 changes: 9 additions & 27 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,33 +36,15 @@ jobs:
run: poetry install

- name: Test with pytest
run: make test
run: python -m pytest --cov-report xml

- name: Extract git branch name
id: git-branch-name
uses: EthanSK/git-branch-name-action@v1

- name: Coveralls parallel
uses: coverallsapp/github-action@1.1.3
- name: Code Coverage Summary Report
uses: irongut/CodeCoverageSummary@v1.3.0
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
flag-name: py-${{ matrix.python-version }}
git-branch: ${{ env.GIT_BRANCH_NAME }}
path-to-lcov: coverage.lcov
parallel: true

finalize-coverage:
needs: test
runs-on: ubuntu-latest
steps:

- name: Git branch name
id: git-branch-name
uses: EthanSK/git-branch-name-action@v1
filename: coverage.xml
badge: true
format: 'markdown'
output: 'both'

- name: Coveralls finish
uses: coverallsapp/github-action@1.1.3
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
git-branch: ${{ env.GIT_BRANCH_NAME }}
parallel-finished: true
- name: Write to Job Summary
run: cat code-coverage-results.md >> $GITHUB_STEP_SUMMARY
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,32 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 1.0.0 (2023-12-20)

### Added
* some internal speedups for `SingleTokenLooupAnnotator`, `MultiTokenLookupAnnotator` and `LookupTrie`
* caching for sorting annotations, which helps with speed
* the `pre_match_words` attribute for `RegexpAnnotator`
* the option to provide a `LookupTrie` to a `MultiTokenAnnotator` directly
* a method for getting all words or, for looking up tokens with specific text values in a `TokenList`, with options for `matching_pipeline`
* automated build/publish on merge to main

### Changed
* sorting `Annotation` and `AnnotationSet` now requires sort key to be provided as a `tuple`, and callbacks as a `frozendict`
* renamed `docdeid.tokenize` to `docdeid.tokenizer`
* renamed `docdeid.process.doc` to `docdeid.process.doc_processor`
* renamed `docdeid.process.annotation_set` to `docdeid.process.annotation_processor`
* `Annotation` and `Token` now only include `int`/`str` fields when serializing
* formatting and linting settings
* moved the logic for linking tokens to `TokenList` rather than `Tokenizer`
* use `casefold()` instead of `lower()` for lowercasing

### Fixed
* a bug with overlapping annotations in `MultiTokenLookupAnnotator`

### Removed
* automated coverage reporting

## 0.1.10 (2023-11-28)

### Added
Expand Down
68 changes: 9 additions & 59 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,60 +1,12 @@
MAX_LINE_LENGTH := 120
CHECK ?= 0
format:
python -m black .
python -m isort .
python -m docformatter .

format_dirs := docdeid/ tests/
lint_dirs := docdeid/

black_args := --line-length $(MAX_LINE_LENGTH)
isort_args := --profile black
docformatter_args := --recursive --wrap-summaries $(MAX_LINE_LENGTH) --wrap-descriptions $(MAX_LINE_LENGTH) --pre-summary-newline

typehints_args := --select ANN001,ANN2,ANN3 --max-line-length $(MAX_LINE_LENGTH)
doclint_args := --disable=all --enable C0112,C0115,C0116
pylint_args := --disable=C0112,C0114,C0115,C0116 --max-line-length=$(MAX_LINE_LENGTH)
mypy_args :=

ifeq ($(CHECK), 1)
black_args += --check
isort_args += -c
docformatter_args := --check $(docformatter_args)
doclint_args += --fail-under 10.0
pylint_args += --fail-under 9.0

else
docformatter_args := --in-place $(docformatter_args)
typehints_args += --exit-zero
doclint_args += --exit-zero
pylint_args += --exit-zero

endif

format: black isort docformat

lint: typehints doclint pylint mypy

black:
python -m black $(black_args) $(format_dirs)

isort:
python -m isort $(isort_args) $(format_dirs)

docformat:
python -m docformatter $(docformatter_args) $(format_dirs)

typehints:
python -m flake8 $(typehints_args) $(lint_dirs)

doclint:
python -m pylint $(doclint_args) $(lint_dirs)

pylint:
python -m pylint $(pylint_args) $(lint_dirs)

mypy:
python -m mypy $(mypy_args) $(lint_dirs)

test:
python -m pytest --cov-report html --cov-report lcov --cov=docdeid --cov-fail-under=80 tests/
lint:
python -m flake8 .
python -m pylint docdeid/
python -m mypy docdeid/

build-docs:
sphinx-apidoc --module-first --force --templatedir=docs/templates -o docs/source/api docdeid
Expand All @@ -67,9 +19,7 @@ clean:
rm -rf .pytest_cache
rm -rf .mypy_cache
rm -rf dist

clean-docs:
rm -rf docs/_build
rm -rf docs/source/api

.PHONY: format lint black isort docformat typehints doclint pylint mypy test clean
.PHONY: format lint clean
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# docdeid

[![tests](https://github.com/vmenger/docdeid/actions/workflows/test.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/test.yml)
[![coverage](https://coveralls.io/repos/github/vmenger/docdeid/badge.svg)](https://coveralls.io/github/vmenger/docdeid)
[![build](https://github.com/vmenger/docdeid/actions/workflows/build.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/build.yml)
[![Documentation Status](https://readthedocs.org/projects/docdeid/badge/?version=latest)](https://docdeid.readthedocs.io/en/latest/)
[![pypy version](https://img.shields.io/pypi/v/docdeid)](https://pypi.org/project/docdeid/)
Expand Down
2 changes: 1 addition & 1 deletion docdeid/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
from .deidentifier import DocDeid
from .document import Document, MetaData
from .pattern import TokenPattern
from .tokenize import Token, Tokenizer, TokenList
from .tokenizer import Token, Tokenizer, TokenList
Loading