Skip to content

Latest commit

 

History

History
 
 

classify-split-extract-workflow

Document AI End-To-End Solution for Classification/Splitting/Extraction

Introduction

This solution aims to streamline document Classification/Splitting and Extracting with all data being saved to the BigQuery using. The user can simply plugin their own custom maid Classifier/Splitter/Extractor by changing the configuration file (and can even do it real-time, since the file is stored in GCS) and specify output BigQuery table for each processor.

For an example use case, the application is equipped to process an individual US Tax Return using the Lending Document AI Processors (out-of-the box Specialized processors). However you absolutely can and should train your own custom Splitter/Classifier and Extractor. Then you can specify fields (labels) to be extracted and saved to bigQuery in the format you need.

NOTE: LDAI Splitter & Classifier in this Demo require allowlisting to use.
Read More about Lending DocAI

Architecture

Architecture

Pipeline Execution Steps

All environment variables referred further on are defined in the vars.sh file.

1. Pipeline Execution Trigger

  • The pipeline is triggered by uploading a document into the GCS bucket.
  • Note, there is a dedicated bucket assigned to send Pub/Sub notifications: CLASSIFY_INPUT_BUCKET.
  • If the DATA_SYNC environment variable is set to true, any PDF document will trigger the pipeline as long as it is uploaded inside the CLASSIFY_INPUT_BUCKET.
  • Otherwise, only the START_PIPELINE file will trigger batch processing of all documents inside the uploaded folder.
  • Do not upload files into the splitter_output sub-folder. This is a system directory used to store split sub-documents and is therefore ignored.

2. Pub/Sub Event Forwarding

3. Workflow Execution

  • The workflow checks the uploaded file and triggers a Cloud Run Job for Document Classification and Splitting.

4. Classification/Splitting by the Cloud Run Job

  • The Classification Cloud Run Job performs the following tasks:
    • Uses Document AI Classifier or Splitter as defined in the config.json file (parser_config/classifier).
    • For each document sent for processing:
      • Determines confidence and type (Classifier).
      • Determines page boundaries and the type of each page (Splitter).
      • Performs the splitting into the splitter_output sub-folder.
    • The config.json file defines the relation between Classifier labels and Document Parsers to be used for those labels, as well as the output BigQuery table for each model.
    • Creates a JSON file inside the CLASSIFY_OUTPUT_BUCKET bucket. This file is the result of the classification/splitting job and is used for extraction.
      • The path to this JSON file is sent back to the GCP Workflow in the callback when the classification job is completed.
    • It also creates the BigQuery mlops tables required for the ML.PROCESS_DOCUMENT function such as:
      • Object tables for the GCS documents.
      • MODEL for the Document AI parsers.
    • And it assigns GCS custom metadata to the input documents with values of classification result (confidence score and document type).
      • These metadata is then also saved along into the BigQuery.
      • Documents that were the result of splitting have metadata pointing to the original document.
    • Here is an example of the output JSON file, where:
      • object_table_name - contains all the documents that were classified/split and ended up having the same document type.
      • model_name - corresponds to the Document AI Extractor MODEL.
      • out_table_name - is the output BigQuery table name to be used for the extraction.
[
    {
        "object_table_name": "classify-extract-docai-01.mlops.GENERIC_FORM_DOCUMENTS_20240713_071014814006",
        "model_name": "classify-extract-docai-01.mlops.OCR_PARSER_MODEL",
        "out_table_name": "classify-extract-docai-01.processed_documents.GENERIC_FORMS"
    },
    {
        "object_table_name": "classify-extract-docai-01.mlops.MISC1099_FORM_DOCUMENTS_20240713_071014814006",
        "model_name": "classify-extract-docai-01.mlops.MISC1099_PARSER_MODEL",
        "out_table_name": "classify-extract-docai-01.processed_documents.MISC1099"
    }
]

5. Entity Extraction

  • Entity Extraction is done by ML.PROCESS_DOCUMENT function as the GCP Workflow next step and saved into the BigQuery.
    • Uses the JSON file created by the Classifier to run the Extraction.

6. Data Integration

  • (Optional Future): Extracted data can be sent downstream to a third party as an API call for further integration as the final step of the Workflow Execution.

Google Cloud Products Used

Quotas

Default Quotas to be aware of:

  • Number of concurrent batch prediction requests - 10
  • Number of concurrent batch prediction requests processed using document processor (Single Region) per region - 5

Big Query Tables

The BigQuery Table schema is determined in the runtime based on the DocumentAI parser used.

  • For the Generic Form parser and OCR parser the schema does not contain any specific fields labels (only the extracted json and metadata). So it is flexible on the usage and all core information is within the json field: ml_process_document_result.
    • Therefore you can easily export data from both OCR and FORM parser into the same BigQuery table.
  • For the Specialized Document parsers (like W-2 Parser) fields are predefined for you.
  • For the user defined Custom Document Extractor the schema corresponds to the labels defined by the user and is fixed once the table is created (thus if you need to make changes to the Extractor, you will need to either start using a new table or manually fix the schema).
    • In order to use same BigQuery table for different Custom Document Extractors, they must use the same schema (and data types being extracted).

Form/OCR Parser

Form/OCR Parser Big Query Table Schema:

Generic Forms

Sample Extracted Data using Form/OCR parser: GENERIC FORMS DATA

Specialized Processors

Corresponding Big Query Table Schema extracted:

  • TODO

Sample of the extracted data:

  • TODO

Custom Document Extractor

User-defined Labels in the DocumentAI console:

PA Forms

Corresponding Big Query Table Schema extracted:

PA Forms Data

Setup

Preparation

The goal is to be able to re-direct in the real time each document to the appropriate Document Extractor.

As a preparation, the user needs to:

  • Define which document types are expected and which document extractors are needed.
  • Train a classifier or splitter, that would be able to predict document class (and optionally identify document boundaries).
  • Deploy (and possibly train) required document extractors.

Dependencies

  1. Install Python
  2. Install the Google Cloud SDK
  3. Run gcloud init, create a new project, and enable billing
  4. Setup application default authentication, run:
  • gcloud auth application-default login

Deployment

Environment Variables

  • Create new GCP Project
export PROJECT_ID=..
export DOCAI_PROJECT_ID=...
gcloud config set project $PROJECT_ID
  • If you want to make use of the existing Document AI processors in another project for example, set the env variable for the Project where processors are located. Otherwise, skip this step.
  • This is needed to set up proper access rights.
export DOCAI_PROJECT_ID=...

Infrastructure Setup

  • Run infrastructure setup script using newly created Project:
./setup.sh

Processors

  • If you do not have any specific documents and do not want to do Document Training, you can make use of the ready-to-use specialized processors:
    • LDAI Splitter & Classifier - requires allowlisting (usually takes one business day)
    • W-2 Parser
    • 1099 Parser(s)

Following script will generate following Document AI processors and update config.json file for you.

./create_demo_rpocessors.sh

BigQuery Reservations

  • Create BigQuery Reservations: Before working with ML.PROCESS_DOCUMENT, you’ll need to turn on BigQuery Editions using the Reservations functionality. You’ll need BigQuery Enterprise with the minimum number of reservable slots (100) and no baseline. This is done through BigQuery > Administration > Capacity management:
    • Create Reservation
    • Make sure to assign QUERY actions to this reservation: Click on the newly created reservation -> ASSIGNMENTS -> CREATE ASSIGNMENT of type QUERY inside your project

Configuration

Here is the explanation of the structure of the config.json that defines agents being used in the pipeline:

parser_config:

  • Contains a list of the arbitrary number of the document extractors and a single classifier:
    • The document extractors:
    • The document Classifier/Splitter:
      • Depending on the needs you should train either a Classifier or a Splitter.
        • Classifier identifies classes of documents from a user-defined set of classes and returns classification label along with the confidence score.
        • Splitter predicts the pages that make up various documents within the composite file and the class of each identified document.
      • The named is reserved and has to be classifier in the config.json.
  • Each processor is described by a dictionary, with the name that will be used in the later document_types_config as the key and following fields:
    • processor_id - full path to the processor,
    • out_table_name - name of the output BigQuery table to which data is saved.

document_types_config:

  • Contains a list of the supported document classes. Each class (or type) is described as a dictionary with following fields:
    • classifier_label - the classification label as trained by the Classifier/Splitter
    • parser - name of document processor to be used for extracting the data.

settings_config:

  • Right now you can specify the classification confidence threshold (classification_confidence_threshold) and the default document type to be returned by the Classification Job (classification_default_class) in case when Classifier is not defined or when returned classification falls behind the confidence threshold.
  • The classification_default_class
  • Modify config.json file to match your needs or leave it as is for the demo with taxes.

  • Copy file to GCS:

    source vars.sh
    gsutil cp classify-job/config/config.json gs://$CONFIG_BUCKET/

Running the Pipeline

Out-of-the box demo

If you followed the steps of LDAI Splitter & Classifier you can try the single document with taxes:

  source vars.sh
  gsutil cp sample-docs/taxes-combined.pdf gs://$CLASSIFY_INPUT_BUCKET/
  • Go the Workflows and check the execution status

  • Go to the Cloud Run jobs and check the Job was triggered by the Workflow

  • When Job is completed, workflow will continue with extraction

  • Check the BigQuery processed_documents dataset. It should have four tables created and filled with the extracted data:

    • W2
    • GENERIC_FORMS (Because we have not created Processor for NEC for type)
    • MISC1099
    • INT1099

Using live data updates

The data synch is disabled/enabled with environment variable DATA_SYNCH defined in vars.sh

  • When DATA_SYNCH is on, each document uploaded to the input bucket (CLASSIFY_INPUT_BUCKET), will trigger the pipeline execution.
  • When DATA_SYNCH is off, only uploading file named START_PIPELINE will trigger the pipeline execution and all files in that directory will be processed (Batch mode).

To trigger single document processing:

  • Modify vars.sh and set DATA_SYNCH to true
  • Redeploy:
./deploy.sh
  • Upload pdf document into the CLASSIFY_INPUT_BUCKET bucket (defined in vars.sh)

Be mindful fo quotas (5 concurrent API requests). Therefore when you upload more than five times, this would trigger separate Pub/Sub events for each file and you will easily reach the quota limit.

Running the Batch

In order to trigger the batch document processing, upload START_PIPELINE (empty) file into the configured trigger input bucket CLASSIFY_INPUT_BUCKET. All .pdf files in that folder will be processed.

source vars.sh
gsutil cp START_PIPELINE gs://"$CLASSIFY_INPUT_BUCKET"/

Next Steps

  • Offer an out-of-the box demo using Specialized Classifier/Extractor and OCR/FORMs parser
  • Dealing with DocAI Quotas (5 concurrent jobs to classifier/docai processor).
  • Dealing with ML Limitations - The function can't process documents with more than 15 pages. Any row that contains such a file returns an error.
    • Split document for more than 15 pages and support retrieving of data across split documents.
  • Add functionality to support document splitting using Splitter:
    • Use GCS custom metadata to provide additional information/context along with the data into the BigQuery (such as original file that was split).
  • Convert setup bash scripts to terraform
  • Use confidence threshold to mark document for Human Review before proceeding with Extraction.
  • Use confidence score of the extracted data to mark document for Human Review.
  • Add support for other MIME/TYPES, such as jpeg, png, etc (Currently only PDF documents are supported)
  • Use GCS custom metadata to provide additional information/context along with the data into the BigQuery.
    • Could be used for example to store information about classification/splitting job.
    • To store classification/splitting job output
  • Add UI + Firebase for HITL and Access to the data.
  • Fix logging handler for CLoud Run Job (requires more setup to set proper levels), otherwise all logging appears as Default level from Cloud Run

References

Similar Demos


Copyright 2024 Google LLC Author: Eva Khmelinskaya