Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Azure Batch OCR job template

This sample shows how to use ghostscript and tesseract-ocr to transform PDF files into plain text files (.txt). It does this in two stages:

  1. Use ghostscript to convert a PDF to a set of PNG files (one for each page of the PDF).
  2. Use tesseract-ocr to convert the PNG images into plain text files (.txt).

Features used by this sample

Prerequisites

You must have an Azure Batch account set up with a linked Azure Storage account.

Create a pool

To create your pool:

az batch pool create --template pool.json

The default settings in pool.json specify a pool named ocrpool containing 3 STANDARD_D1_V2 virtual machines.

If you want to change the default values of the pool creation, you can create a JSON file to supply the parameters of your pool. If you have a large number of files to convert, you should use a larger pool or bigger VMs in the pool.

In order to create the pool with your own configurations, run:

az batch pool create --template pool.json --parameters <your settings JSON file>

You are billed for your Azure Batch pools, so don't forget to delete this pool through the Azure portal when you're done.

Upload files

To upload your PDF files:

az batch file upload --local-path <path> --file-group <group>

Run this command on a folder containing the PDF files you want to process.

Create a job and tasks

Edit the job.parameters.json file to supply parameters to the template. If you want to configure other options of the job, such as the the pool id, you can look in the job.json parameters section to see what options are available.

Parameter Required Description
jobId Mandatory The id of the Azure Batch job.
poolId Optional The id of the Azure Batch pool to run on.
Must match the id of the pool you created earlier.
Default value if not otherwise specified: ocrpool
inputFileGroup Mandatory The file group containing the input files.
Must match the name of the file group used by your az batch file upload command earlier.
outputFileStorageUrl Mandatory A storage SAS URL to a container with write access.
A general SAS url to blob storage will not work.

Run the job

To create your job and tasks:

az batch job create --template job.json --parameters job.parameters.json

The outputs of the tasks will be uploaded to the Azure Storage container which you specified as the individual tasks complete. The target container will contain a new virtual directory for each task that ran.

Monitor the job

You can use this command to monitor the tasks in the job and their progress:

az batch task list --job-id <jobid>`

You can also use the Azure portal or Batch Explorer for monitoring.

Structure of the sample

File Content
pool.json A template for creating the pool required for OCR processing.
job.json A template for the job to run, including parameter definitions and a parametricSweep task factory.
job.parameters.json Provides values for the parameters defined in job.json.