-
Notifications
You must be signed in to change notification settings - Fork 1
Workers
SmartDocumentor implements workers to read, process and review documents. It is possible to adjust settings of the pre-built plugins or create new plugins to interact differently with SmartDocumentor. Here we show how to configure:
- Custom Worker
- Process Invoice
- Folder Monitor
- Pop3 Monitor
- Document Publisher
- Router Worker
- Move Task To History
When creating a plugin, it is possible to implement custom workers. In order to initiate a new worker first the class must inherit from BaseWorker:
17. public class GenericOutputWorker : BaseWorker
18. {}
In addition, there are two methods that must be overwritten, first is InitializeWorkerMain and second ProcessItem.
The example that follows, shows how a simple is using GenericPlugin to read fields from a source and them process and write them into the task.
1. protected override void InitializeWorkerMain()
2. {
3. if (WorkerSettings.TryGetValue("ForceExtractedFieldList", out string forceExtractedFieldListSetting))
4. {
5. if (!bool.TryParse(forceExtractedFieldListSetting, out forceExtractedFieldList))
6. {
7. throw new ArgumentException($"Invalid argument 'ForceExtractedFieldList'. Value: '{forceExtractedFieldListSetting}'.");
8. }
9. }
10. }
11.
12. public override void ProcessItem(SDTask task)
13. {
14. this.ExtractedFieldList = new List<ExtractedField>();
15. this.OcrJobResultList = task.GetOrc();
16. this.OcrJobResultByPageNumber = task.GetOrcByPageNumber();
17. this.InvoiceData = task.GetInvoiceData();
18. if (this.InvoiceData != null)
19. {
20. foreach (ExtractedField item in this.InvoiceData.Fields)
21. {
22. if ((string.IsNullOrEmpty(task.GetPropertyValue(item.FieldName)) || forceExtractedFieldList) &&
23. !string.IsNullOrEmpty(item.Entity?.Text))
24. {
25. this.AddOrReplaceExtractedField(item);
26. task.SetPropertyValue(item.FieldName, item.Entity.GetEntityTextValue(item.FieldName, Editors.CaptureFieldDataType.String));
27. }
28. }
29. }
30. this.SetTaskData(task);
31. }
Setting if the document is to send to field extraction is simplified with this worker. It was developed under Generic Plugin Workers and is to be included inside the process station workflow. It includes common configurations that may be considered before sending the document to the API. Also allows the client to have its own keys and settings to the API.
- WebApiUrl -- SmartDocumentor's API URL;
- WebApiKey -- SmartDocumentor's Key;
- WebApiSecret -- SmartDocumentor's Secret;
- RemoveVendorVATMetadataField -- Removes the VendorVAT field if this field on only on the template metadata. When this field in only on the template metadata, it will have no bounds.
- RemoveCalculatedFields -- Removes all fields that were automatically calculated by SmartDocumentor API;
- UseTaskOCR -- If "false" SmartDocumentor runs the OCR before sending to the API, this will not attach the OCR to the task;
- UseOnlyFirstPage -- If "false" and document has more than one page SmartDocumentor will use both fisrt and last page to detect entities.
Setting Process Invoice in the workspace.config.xml should be similar to this:
<Step From="OCRCompleted" Using="ProcessInvoiceWorker" To="DocumentRecognized" Assembly="SmartDocumentor.GenericPlugin" Namespace="SmartDocumentor.GenericPlugin.Workers">
<SettingList>
<Setting Name="WebApiUrl" Value="" />
<Setting Name="WebApiKey" Value="" />
<Setting Name="WebApiSecret" Value="" />
<Setting Name="RemoveCalculatedFields" Value="False" />
<Setting Name="CustomerFiscalNumber" Value="" />
<Setting Name="UseTaskOCR" Value="False" />
<Setting Name="UseOnlyFirstPage" Value="false" />
</SettingList>
</Step>
One of the common documents input it's by folder, for that SmartDocumentor contains a folder monitor worker that will allow process files with different parameters.
Folder monitor is one of the core components of SmartDocumentor**.** It is possible to set a post processing action and if needed, to separate the document in multiples ones or even just to remove white pages.
Currently, folder monitor allows separation by white pages, any n-pages or barcodes. Is also possible to set the destiny of the input file to be deleted, moved to the storage folder or uploaded to any other storage.
- Folders -- the folders to monitor.
- FilePatterns -- by default is used the following pattern "*.tif|*.tiff|*.pdf|*.jpg|*.png".
- SubFoldersLevel -- How many sub-folders levels to monitor, root is 0.
- TaskProperties -- Allows to set pair of key and values in the task. Example: "_BatchName=User1; _DocumentPreProccessedStatus=True"
- SearchTimerInterval -- default 10000(in milliseconds), corresponds in 10 seconds.
- InputFileDestination -- as previously mentioned, where to move the read file. Delete, MoveToStorageID or MoveToConnectionString.
- MoveDocStorageConnectionString -- if MoveToConnectionString is selected it is needed a connection string.
- DocSeparationEnabled - "true" - Whether "True" or "False" if enabled.
- DocSeparationMethod - previously mentioned. Default is None but is accepted the values BlankPage, EveryNPages or Barcodes.
- DocSeparationEveryNPagesPageSetCount - If DocSeparationMethod is EveryNPages it is required to set the value to a integer value.
- DocSeparationBarcodeType -- Type or barcode, if enabled. Example Code128.
- DocSeparationBarcodeValueIsRegExpression - - "true" - Whether "True" or "False" if DocSeparationBarcodeValue is a regex.
- DocSeparationBarcodeValue -- The value of the barcode to detect. It can be regex.
- DocSeparationBarcodeMinConfidence -- 0 to 100.
- DocSeparationBarcodeTaskProperty -- name of the property.
- BarcodeExcludePageWithBarcode - "true" - Whether "True" or "False" if to remove the barcode page.
- IgnoreBlankPagesEnabled - "true" - Whether "True" or "False" if enabled.
- IgnoreBlankPagesThreshold - 0 to 8
Setting Folder Monitor in the workspace.config.xml should be similar to this:
<Step Using="FolderMonitorWorker" To="FileImportedFromFolder">
<SettingList>
<Setting Name="Folders" Value="\\localhost\GenericPluginDemo\DemoInvoice\Input" />
<Setting Name="FilePatterns" Value="*.pdf|*.tif*" />
<Setting Name="TaskProperties" Value="_BatchName=PastaInput" />
<Setting Name="SearchTimerInterval" Value="30000" />
<Setting Name="InputFileDestination" Value="MoveToConnectionString" />
<Setting Name="MoveDocStorageConnectionString" Value="Provider=SDFilesystem;Path=\\localhost\GenericPluginDemo\DemoInvoice\Input_ARCHIVE;CredentialsMode=None;AuthProtocol=NTLM;IsPasswordSecure=True" />
</SettingList>
</Step>
<Step From="FileImportedFromFolder" Using="DocUploadWorker" To="DocFromFolderUploaded">
<RetryPolicyConfig Type="RetryN" NumberOfRetries="5" IntervalBetweenRetries="5" />
<SettingList />
</Step>
<Step From="DocFromFolderUploaded" Using="TaskUploadWorker" To="Workspace:ToProcess">
<RetryPolicyConfig Type="RetryN" NumberOfRetries="5" IntervalBetweenRetries="5" />
<SettingList />
</Step>
In alternative to other input origins it is possible to configure an email as source for the documents. The configuration is facilitated by this worker, in which it is only needed to configure the host information as well as the account user and password. What the worker will do is try to get valid attachments from emails and create a new task from the resulting file.
For this worker it will be needed to use the Management Station in order to encrypt the password.
This is a good example of how to use the management station on an existing work space configuration.
- Open management station - make sure you configure the configuration path ;
- On the workspace editor open the workspace and from the pipelines select ProcessStation ;
- On the top bar select Activities > Data Acquisition > Email (Pop3) Monitor;
-
On The Properties tab on the left side select the three dots on the password field to open the Password Editor and insert the account password. When you click OK it will encrypt the password and that's the password that will be written in the configuration. Fill the rest of the elements for the email account.
-
Finally it is possible to connect the Pop3MailMonitor to the FileImportedFromFolder state. Just drag from one of the squares in the sides of the Pop3MailMonitor to one of the squates on the FileImportedFromFolder and a connection will be made. This will link the worker to the existing configuration and now it is possible to get documents both from folder monitor and email.
- Hostname -- Connection hostname;
- Port -- Connection port;
- UseSSL -- If to use SSL;
- Username -- Account username;
- Password -- Account password;
- SearchTimerInterval -- Get email messages time interval in millisecond;
- TaskProperties -- List of custom properties to insert on the task, separated by ';' (semicolon) ;
- MinimumAttachmentHeight -- File Minimum Height;
- MinimumAttachmentWidth -- File Minimum Width;
- ValidAttachmentFileExtensions -- File Valid Extensions.
Setting Pop3 Monitor in the workspace.config.xml should be similar to this:
<Step Using="Pop3MonitorWorker" To="FileImportedFromFolder">
<SettingList>
<Setting Name="Hostname" Value="Outlook.office365.com" />
<Setting Name="Port" Value="995" />
<Setting Name="UseSSL" Value="True" />
<Setting Name="Password" Value="Password" />
<Setting Name="Username" Value="Username" />
<Setting Name="SearchTimerInterval" Value="90000" />
<Setting Name="TaskProperties" Value="CustomProperty=CustomValue;CustomProperty_1=CustomValue_1" />
<Setting Name="MinimumAttachmentHeight" Value="800" />
<Setting Name="MinimumAttachmentWidth" Value="500" />
<Setting Name="ValidAttachmentFileExtensions" Value=".jpg|.jpeg|.bmp|.tif|.tiff|.png|.pdf" />
</SettingList>
</Step>
Clients have the need to storage the documents once they have processed the information they need. This worker is used in our pipelines a lot of times, and it makes the requirement of saving into folder or uploading to SharePoint very easy.
-
DocumentExistsException -- Document exists exception Queue Id;
-
StorageConnectionString -- This field requires a provider and the data to connect to it.The parameters for the connection string are separated by ';' (semicolon). Currently we support the following providers, see the parameters for each one:
- sdsharepoint - ListName; ListFolder; Username; Password; CredentialsMode; AuthTargetUrl; WebUrl; IsPasswordSecure;
- sdsharepointonline - ListName; ListFolder; Username; Password; CredentialsMode; AuthTargetUrl; WebUrl; IsPasswordSecure;
- sdfilesystem - Path; CredentialsMode; AuthProtocol; IsPasswordSecure;
-
Filename -- Opcional, it can contain tags for the task fields;
-
OverwriteDestinationFile -- "true" if to overwrite destination file;
-
ConvertToSearchablePdf -- Convert to Searchable PDF;
-
ConvertToPdf -- Convert to PDF;
-
SearchablePdfAutoDeskew -- Automatically deskew image when creating Searchable PDF;
-
SearchablePdfPageImageColor -- Set the image color (Bitonal, Grayscale, Color or AutoDetect);
-
PathImageZones -- Path image zones;
-
DocumentExistsMessage -- Custom document exists message;
-
TrimFilename -- "true" if to trim filename;
-
MetadataWhitelist -- Task properties to be white listed.
Setting Document Publisher in the workspace.config.xml should be similar to this:
<Step From="WaitingRequest" Using="DocumentPublisherWorker" To="Integrated">
<RetryPolicyConfig Type="RetryN" NumberOfRetries="0" IntervalBetweenRetries="0" />
<SettingList>
<Setting Name="StorageConnectionString" Value="Provider=sdsharepointonline;ListName=acccount;ListFolder=SupplierInvoices/AwaitingDeliveryNote/{$ Year}/;Username=user@devscope.net;Password=password;CredentialsMode=SpecificUser;AuthTargetUrl=https://devscope.sharepoint.com;AuthProtocol=MSOnlineClaims;WebUrl=https://devscope.sharepoint.com;IsPasswordSecure=False;"/>
<Setting Name="Filename" Value="{$taskelement1}_{$taskelement2}_{$taskelement3}.pdf" />
<Setting Name="Subject" Value="{$Subject}" />
<Setting Name="TrimFilename" Value="true" />
<Setting Name="OverwriteDestinationFile" Value="True" />
<Setting Name="ConvertToSearchablePdf" Value="True" />
<Setting Name="MetadataWhitelist" Value="Year" />
<Setting Name="DocumentExistsException" Value="workspace:Error" />
<Setting Name="DocumentExistsMessage" Value="It was not possible to send to the folder of documents awaiting order. There is a file with the name {Path} {Filename}." />
</SettingList>
</Step>
If you want to use save to the file system replace the StorageConnectionString parameter by something similar to this:
<Setting Name="StorageConnectionString" Value="Provider=sdfilesystem;Path=C:\devscope\SmartDocumentorClients\Client\Output;CredentialsMode=None;AuthProtocol=NTLM;IsPasswordSecure=False" />
In the workflow some times it is required that for the same step different actions may occur, to simplify that SmartDocumentor implements the router worker. Router does only one thing it takes the settings that start with route and sets the next state based on the values of the task elements.
To help understand we implemented an example of how to build a custom action and them use router worker to help decide the next step inside de Demo Project.
- route:<any next step> -- the value can use any of the task elements inside {} followed by the expected value.
Demo Example:
23. protected override void InitializeWorkerMain()
24. {
25. base.InitializeWorkerMain();
26.
27. // Custom
28. if (WorkerSettings.TryGetValue("MinConfidence", out string minConfidenceString))
29. {
30. if (!int.TryParse(minConfidenceString, out MinConfidence))
31. {
32. throw new ArgumentException($"Invalid argument 'MinConfidence'. Value: '{minConfidenceString}'.");
33. }
34. }
35.
36. if (WorkerSettings.TryGetValue("ConfidencePropertyName", out ConfidencePropertyName))
37. {
38. }
39.
40. if (WorkerSettings.TryGetValue("ConfidenceEnabled", out string confidenceEnabledString))
41. {
42. if (!bool.TryParse(confidenceEnabledString, out ConfidenceEnabled))
43. {
44. throw new ArgumentException($"Invalid argument 'ConfidenceEnabled'. Value: '{confidenceEnabledString}'.");
45. }
46. }
47. }
First we created a custom code to get the MinConfidence, ConfidencePropertyName and ConfidenceEnabled from the workspace configuration. As these are set inside the ProcessDocumentWorker class we had to add this as settings under the steps that calls this class.
85. private void CheckConfidence(SDTask item)
86. {
87. if (!ConfidenceEnabled)
88. {
89. return;
90. }
91.
92. var confidenceAverage = this.ExtractedFieldList.Average(c => c.Entity?.Confidence ?? 0);
93. if (confidenceAverage > MinConfidence)
94. {
95. item.SetPropertyValue(ConfidencePropertyName, bool.TrueString);
96.
97. }
98. else
99. {
100. item.SetPropertyValue(ConfidencePropertyName, bool.FalseString);
101. }
102. }
Then we just created the logic that we wanted to ensure for our confidence check and set the result to a variable in the task.
After that we can include this logic into our process station workflow.
To see the full code please see our Demo Project.
Setting Router Worker in the workspace.config.xml should be similar to this:
<Step From="Workspace:ConfidenceCheck" Using="RouterWorker" To="Workspace:ToReview">
<SettingList>
<Setting Name="route:Integrate" Value=""{ConfidenceCheck}"=="True"" />
<Setting Name="route:Workspace:ToReview" Value=""{ConfidenceCheck}"=="False"" />
</SettingList>
</Step>
After processing the tasks, the last state is commonly Workspace:Final, but it is common to move these processed documents into a different table that is usually TasksHistory. This worker helps to keep the querying of the Tasks database faster, and also gives different actions to take after the last step. It allows to move files to a certain storage or to reduce the amount of fields inside the task.
- DocFileDestination -- The document destination accepts the values NoAction, Delete, MoveToStorageID and MoveToConnectionString. Expect for the NoAction all the other parameters will move the task from the Tasks table to the TasksHistory table and delete/move the file from the original storage.
- MoveDocStorageId -- Storage id, if you selected MoveToStorageID you have to pass here the storage identification;
- MoveDocStorageConnectionString -- Connection string, if you selected MoveToConnectionString you have to pass here the storage identification, it can be a folder or a SharePoint location - see the Document Publisher Worker to see the configurations for each provider;
- PropertiesToRemove -- Properties to remove from the task. For example if you want to make your task lighter for storage just remove the OCR.
Setting Move Task To History Worker in the workspace.config.xml should be similar to this:
<Step From="Workspace:Final" Using="MoveTaskToHistoryWorker" To="Workspace:History">
<RetryPolicyConfig Type="RetryN" NumberOfRetries="3" IntervalBetweenRetries="5" />
<SettingList>
<Setting Name="PropertiesToRemove" Value="_OcrPageModel_\d+|InvoiceData|InvoiceLearningContext" />
<Setting Name="DocFileDestination" Value="MoveToConnectionString" />
<Setting Name="MoveDocStorageConnectionString" Value="Provider=sdfilesystem;Path=E:\Smartdocumentor\SmartDoc\SDDocs_Processed;CredentialsMode=None;AuthProtocol=NTLM;IsPasswordSecure=True" />
</SettingList>
</Step>
Adress: R. de Passos Manuel 223 3°, 4000-385 Porto, Portugal
Email: support@devscope.net
Phone: +315 22 375 1350
Working Days/Hours: Mon-Fri/9:00AM-19:00PM
Copyright © DevScope