This repository constructs an end-to-end pipeline on GCP to process expenses (ie. receipts) with the Document AI API. This repository serves as a sample code to build your own demo but is not tested for production.
-
Create a Google Cloud Platform Project
-
Enable the Cloud Document AI API, Cloud Functions API and Cloud Build API in the project you created in step #1.
-
If you do not have access to the parser, request access via this link. Here is a link to the official Expense Parser documentation.
-
Create a service account that will later be used by Cloud Functions
- Navigate to IAM & Admin -> Service Accounts
- Click on Create a service account
- In the Service account name section, type in
process-receipt-example
or a name of your choice - Click Create and continue
- Grant this service account the following roles:
- Storage Admin
- BigQuery Admin
- Document AI API User
- Click Done and you should see this service account in the IAM main page
-
Create your Doc AI processor
- At this point, you should have your request in Step 3 approved and have access to expense parser
- Navigate to console -> Document AI -> processors
- Click Create processor and choose expense parser
- Name your processor and click Create
- Take note of your processor's region (eg. us) and processor ID
-
Activate your Cloud Shell and clone this GitHub repository in your Command shell using the command:
gh repo clone GoogleCloudPlatform/document-ai-samples
Note: If you are using your local terminal, please follow this link to install and initiate Google Cloud CLI before this step.
-
Execute Bash shell scripts in your Cloud Shell terminal to create cloud resources (i.e Google Cloud Storage Buckets, Pub/Sub topics, Cloud Functions, BigQuery dataset and table)
-
Change directory to the scripts folder
cd community/expense-parser-python
-
Update the following values in .env.local:
- PROJECT_ID should match your current project's ID
- BUCKET_LOCATION is where you want the raw receipts to be stored
- CLOUD_FUNCTION_LOCATION is where your code executes
- CLOUD_FUNCTION_SERVICE_ACCOUNT should be the same name you created in Step 4
vim .env.local
-
Make your .sh files executable
chmod +x set-up-pipeline.sh
-
Change directory to the cloud functions folder
cd cloud-functions
-
Update the following values in .env.yaml (from your note in Step 5):
- PARSER_LOCATION
- PROCESSOR_ID
vim .env.yaml
-
Go back to the original folder and execute your .sh files to create cloud resources
cd .. ./set-up-pipeline.sh
-
-
Testing/Validating the demo
- Upload a sample receipt in the input bucket (<project_id>-input-receipts)
- At the end of the processing, you should expect your BigQuery tables to be populated with extracted entities (eg. total_amount, supplier_name, etc.). Note that each row is an extracted entity instead of a document.
- With the structured data in BigQuery, we can now design downstream analytical tools to gain actionable insights as well as detect errors/frauds.
This community sample is not officially maintained by Google.