Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Jerome
Advisor
Advisor
Reading a PDF to extract and structure the data it contains can be a tedious work, especially when you got hundreds of files to process, and this can lead to many errors with all the consequences it can have.

In this blog post, we will see how it is possible to combine Intelligent RPA and Document Information Extraction to automate the processing of your PDF documents.

If you never heard of Document Information Extraction, let me introduce what Document Information Extraction is.

Document Information Extraction (also commonly called DOX) is a service you can use to process documents that have content in headers and tables. Typically, you can use it to extract data from invoices, or payment notes. With such a service you can upload a PDF document and get the extracted data as a JSON object.

You can find all kind of useful information regarding the service on this page.

But first things first : let's set up the service so we can use it.

Setup the DOX service


You need to have a SAP CP global account and a CPEA license.

  1. Create a subaccount in your global account using the SAP Cloud Platform Cockpit (of course you can use the same subaccount than the one you're using for Intelligent RPA)

  2. Create a Space

  3. Create a Service Instance of DOX (you might need to add a Service Plans for this service to your subaccount first)

  4. Create a Service Key (note that you can find more information about this procedure here)

  5. Once the Service Key is created, you should have a JSON object with the following structure:


{
"url": "",
"uaa": {
"uaadomain": "",
"tenantmode": "",
"sburl": "",
"clientid": "",
"verificationkey": "",
"apiurl": "",
"xsappname": "",
"identityzone": "",
"identityzoneid": "",
"clientsecret": "",
"tenantid": "",
"url": ""
}
}

Let's save the url, uaa.clientid, uaa.clientsecret and uaa.url as we will need them later.

Create a bot


Create a new project and a new workflow.

Set up the variables



  1. Create input/output parameters in the context of the scenario

  2. Insert the Set Context activities as shown below

  3. For each one of them, set the value in the dedicated variable (see below) :

    • Set the credentials (see below for the properties
      Don't forget to replace uaa.clientid and uaa.clientsecret with the values from the Service Key you created before.

    • Set the tokenURL variable : Don't forget to replace uaa.url with the values from the Service Key.

    • Set the doxURL variable :Replace the url value with the one from the Service Key. The last part was inserted as requested by the documentation.

    • Last, set the path of a PDF file. Ex:




Web services to use DOX



  1. Insert a web service activity, which properties are shown below :

  2. Insert another web service activity. This one will be used to send the PDF document to the DOX service. Properties of this activity are shown below :

  3. Finally, insert a last web service activity, and provide the following properties :


 

You should have the workflow below :


Now, build the solution. We'll have to make some adjustments in the JavaScript code which is generated.

Note : in this case the value of the clientid and clientsecret were hard-coded in the property value of the Set Context activity. To be more compliant with security, you can of course store these values in some Factory credentials variable, and retrieve it in the workflow, as explained in this blog.

Update the code


Since we cannot provide all the required options in the web service calls, we need to do it manually.

  1. WS call to generate the token so it is as shown below
    // ----------------------------------------------------------------
    // Step: Generate_token
    // ----------------------------------------------------------------
    GLOBAL.step({ Generate_token: function(ev, sc, st) {
    var rootData = sc.data;
    ctx.workflow('ExtractDataFromPDF', '7e436b4c-226a-4a50-93c3-4948023e62db') ;
    // Generate token
    ctx.ajax.call({
    url: rootData.WS.Input.ServiceKey.tokenURL + '/oauth/token?grant_type=client_credentials',
    method: e.ajax.method.get,
    contentType: e.ajax.content.json,
    header:{
    Accept: e.ajax.content.json,
    Authorization: rootData.WS.Input.ServiceKey.client
    },
    success: function(res, status, xhr) {
    sc.localData.token = 'Bearer ' + ctx.get(res, 'access_token');
    sc.endStep(); // Upload_file
    return;
    },
    error: function(xhr, error, statusText) {
    ctx.log(' ctx.ajax.call error: ' + statusText);
    }
    });
    }});

    The data attribute has been removed and the header has been inserted. Also, in the success callback, we need to extract only the information we need to set it in the sc.localData.token variable.

  2.  When we upload the PDF file, we should use the following code instead of the one which is automatically generated :
    // ----------------------------------------------------------------
    // Step: Upload_file
    // ----------------------------------------------------------------
    GLOBAL.step({ Upload_file: function(ev, sc, st) {
    var rootData = sc.data;
    ctx.workflow('ExtractDataFromPDF', '0a1f4176-8907-4544-a807-65095d30ea36') ;
    // Upload file
    ctx.ajax.call({
    url: rootData.WS.Input.ServiceKey.doxURL + '/document/jobs',
    method: e.ajax.method.post,
    formData: [{
    file:rootData.WS.Input.filePath,
    type:e.ajax.content.pdf,
    name:'file'
    },{
    value:ctx.json.stringify({"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}),
    type:e.ajax.content.jsonText,
    name:'options'
    }],
    header:{
    Accept:e.ajax.content.json,
    Authorization: sc.localData.token
    },
    contentType: e.ajax.content.json,
    success: function(res, status, xhr) {
    rootData.WS.Output.docId = ctx.get(res, 'id');
    sc.endStep(); // Retrieve_extracted_da
    return;
    },
    error: function(xhr, error, statusText) {
    ctx.log(' ctx.ajax.call error: ' + statusText);
    }
    });
    }});

    We provide inputs as a formData, composed of the path of the PDF files, and options which are expected by the DOX service. Also, we need to provide the Authorization header so the service allows us to use it. When the upload of the document is done, a unique id is generated by the service. We will use this id so we can retrieve data from the document.

  3. Now comes the fun part : the service might take a while to process the document and to send back the data. While the service is processing the document, it will send PENDING as the result status. When the processing of the document is complete, we will get the status DONE. As we do not want to block to bot while it is waiting the result from DOX, let's add a way to loop so it can periodically ask for the results.

    • First, insert a new link between the last step and itself in the definition of the scenario:
      // ----------------------------------------------------------------
      // Scenario: ExtractDataFromPDF
      // ----------------------------------------------------------------
      GLOBAL.scenario({ ExtractDataFromPDF: function(ev, sc) {
      var rootData = sc.data;

      sc.setMode(e.scenario.mode.clearIfRunning);
      sc.setScenarioTimeout(600000); // Default timeout for global scenario.
      sc.onError(function(sc, st, ex) { sc.endScenario(); }); // Default error handler.
      sc.onTimeout(30000, function(sc, st) { sc.endScenario(); }); // Default timeout handler for each step.
      sc.step(GLOBAL.steps.Set_credentials, GLOBAL.steps.Set_tokenURL);
      sc.step(GLOBAL.steps.Set_tokenURL, GLOBAL.steps.Set_doxURL);
      sc.step(GLOBAL.steps.Set_doxURL, GLOBAL.steps.Set_file_path);
      sc.step(GLOBAL.steps.Set_file_path, GLOBAL.steps.Generate_token);
      sc.step(GLOBAL.steps.Generate_token, GLOBAL.steps.Upload_file);
      sc.step(GLOBAL.steps.Upload_file, GLOBAL.steps.Retrieve_extracted_da);
      sc.step(GLOBAL.steps.Retrieve_extracted_da, GLOBAL.steps.Retrieve_extracted_da, 'loop');
      sc.step(GLOBAL.steps.Retrieve_extracted_da, null);
      }}, ctx.dataManagers.rootData).setId('57146a42-30d7-47af-8aad-9844f008f7d8') ;​

      Using the name 'loop' we will be able to make the bot go to the step Retrieve_extracted_da (which is, in our case, the same step it was in previously)

    • Then, update the code of the last web service call, so you get :
      // ----------------------------------------------------------------
      // Step: Retrieve_extracted_da
      // ----------------------------------------------------------------
      GLOBAL.step({ Retrieve_extracted_da: function(ev, sc, st) {
      var rootData = sc.data;
      ctx.workflow('ExtractDataFromPDF', '6ae7f1e0-85e1-410d-9ce6-f9509d1d5242') ;
      // Retrieve extracted data
      ctx.ajax.call({
      url: rootData.WS.Input.ServiceKey.doxURL + '/document/jobs/' + rootData.WS.Output.docId + '?clientId=c_00&timestamp=' + new Date().getTime(),
      method: e.ajax.method.get,
      header:{
      Accept:e.ajax.content.json,
      Authorization: sc.localData.token
      },
      contentType: e.ajax.content.json,
      success: function(res, status, xhr) {
      if (res.status == 'DONE'){
      rootData.WS.Output.data = res;
      sc.endStep(); // end Scenario
      return;
      } else if (res.status == 'PENDING') {
      ctx.wait(function(){
      sc.endStep('loop');
      },5000);
      }
      },
      error: function(xhr, error, statusText) {
      ctx.log(' ctx.ajax.call error: ' + statusText);
      }
      });
      }});​

      As described above, when the status is PENDING, we will loop over this step after a short delay (here, 5000 ms). This can be achieved with the sc.endStep('loop') instruction.




Important note : if we take a look at the code, we can see that we give a clientId parameter in the URL when we retrieve the data. This parameter is also present in the options we send to DOX when we upload the service. In our case, this parameter has the value c_00. If you read the documentation of the API of the service, it is written that this parameter is mandatory. To make sure that you can use the service with this client id, you need to create one client using the Client API (details here).

Conclusion


At this point the data which were extracted from the document are stored in the rootData.WS.Output.data variable. So you can pass them as parameter of another scenario to process invoices for example.

As detailed in the documentation, the result is a JSON object. For each data extracted, there is a confidence score (number between 0 and 1) which can be used in the scenario when you need to work with these data.

One might imagine a scenario where an error is raised if there is a value with a confidence score lower than 0.8 for example.

 

And more...


You can find this content as a webinar or see this page to find it in the list as well as other webinar offerings.

And you can even download the sample directly from the Store in the Factory !


Last you can watch this video presenting an end-to-end use case.

What's next ?


Now... it's up to you to build powerful bots combining Intelligent RPA and other services. You know what to do !
22 Comments