ocrpostcorrection

In 2017 and 2019 a competion on Post-OCR Text Correction was organized. This repository contains the 'working' notebooks for reproducing the best results results of the competition and possibly improving them. The code in the notebooks use functionality from the ocrpostcorrection package.

Install dependencies

git clone https://github.com/jvdzwaan/ocrpostcorrection.git
cd ocrpostcorrection
pip install -e .

How to use

This repository contains two sets of notebooks:

local notebooks to be run locally, e.g., for generating datasets
colab notebooks to be run on machines with a GPU, e.g., for training neural networks


    ocrpostcorrection
    ├── LICENSE
    ├── README.md
    ├── colab                                      <- Notebooks to be run on GPU
    │   ├── icdar-task1-hf-evaluation.ipynb        <- Evaluate Huggingface BERT model for error detection
    │   ├── icdar-task1-hf-train.ipynb             <- Train Huggingface BERT model for error detection
    │   ├── icdar-task2-seq2seq-evaluation.ipynb   <- Evaluate performance of error correction model
    │   └── icdar-task2-train-seq2seq.ipynb        <- Train error correction model
    └── local                                      <- Notebooks to be run locally
        ├── data                                   <- Data generated and/or used by local notebooks
        ├── evalTool_ICDAR2017.py                  <- ICDAR competition evaluation script
        ├── icdar-create-hf-dataset.ipynb          <- Create Huggingface dataset from the icdar data
        ├── icdar-task2-create-dataset.ipynb       <- Create error correction dataset from the icdar data
        ├── icdar-task2-results-analysis.ipynb     <- Preliminary analysis of error correction results
        └── perfect_task1+2_output_analysis.ipynb  <- Analysis of evalTool script for measuring performance

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.dvc		.dvc
colab		colab
data		data
local		local
models		models
reports		reports
results		results
src		src
templates		templates
.dvcignore		.dvcignore
.flake8		.flake8
.gitattributes		.gitattributes
.gitconfig		.gitconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
check_tokenization.ipynb		check_tokenization.ipynb
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
error-detection-train-log.csv		error-detection-train-log.csv
icdar-explore-model-results-task2.ipynb		icdar-explore-model-results-task2.ipynb
icdar-explore-model-results.ipynb		icdar-explore-model-results.ipynb
icdar-task2-add-bert-output-to-data.ipynb		icdar-task2-add-bert-output-to-data.ipynb
icdar-task2-seq2seq-evaluation.ipynb		icdar-task2-seq2seq-evaluation.ipynb
icdar_task2_check_input_hidden.ipynb		icdar_task2_check_input_hidden.ipynb
params.yaml		params.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
task1-flair.ipynb		task1-flair.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocrpostcorrection

Install dependencies

How to use

About

Releases

Packages

Languages

License

jvdzwaan/ocrpostcorrection-notebooks

Folders and files

Latest commit

History

Repository files navigation

ocrpostcorrection

Install dependencies

How to use

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages