This sample shows how to use ghostscript
and tesseract-ocr
to transform PDF files into plain text files (.txt
). It does this in two stages:
- Use
ghostscript
to convert a PDF to a set of PNG files (one for each page of the PDF). - Use
tesseract-ocr
to convert the PNG images into plain text files (.txt
).
- Pool and job templates with parameterization
- Parametric sweep task factory
- Automatic persistence of task output files to Azure Storage
- Easy software installation via package managers
You must have an Azure Batch account set up with a linked Azure Storage account.
To create your pool:
az batch pool create --template pool.json
The default settings in pool.json
specify a pool named ocrpool
containing 3 STANDARD_D1_V2 virtual machines.
If you want to change the default values of the pool creation, you can create a JSON file to supply the parameters of your pool. If you have a large number of files to convert, you should use a larger pool or bigger VMs in the pool.
In order to create the pool with your own configurations, run:
az batch pool create --template pool.json --parameters <your settings JSON file>
You are billed for your Azure Batch pools, so don't forget to delete this pool through the Azure portal when you're done.
To upload your PDF files:
az batch file upload --local-path <path> --file-group <group>
Run this command on a folder containing the PDF files you want to process.
Edit the job.parameters.json
file to supply parameters to the template. If you want to configure other options of the job, such as the the pool id, you can look in the job.json
parameters section to see what options are available.
Parameter | Required | Description |
---|---|---|
jobId | Mandatory | The id of the Azure Batch job. |
poolId | Optional | The id of the Azure Batch pool to run on. Must match the id of the pool you created earlier. Default value if not otherwise specified: ocrpool |
inputFileGroup | Mandatory | The file group containing the input files. Must match the name of the file group used by your az batch file upload command earlier. |
outputFileStorageUrl | Mandatory | A storage SAS URL to a container with write access. A general SAS url to blob storage will not work. |
To create your job and tasks:
az batch job create --template job.json --parameters job.parameters.json
The outputs of the tasks will be uploaded to the Azure Storage container which you specified as the individual tasks complete. The target container will contain a new virtual directory for each task that ran.
You can use this command to monitor the tasks in the job and their progress:
az batch task list --job-id <jobid>`
You can also use the Azure portal or Batch Explorer for monitoring.
File | Content |
---|---|
pool.json |
A template for creating the pool required for OCR processing. |
job.json |
A template for the job to run, including parameter definitions and a parametricSweep task factory. |
job.parameters.json |
Provides values for the parameters defined in job.json . |