cancel
Showing results for 
Search instead for 
Did you mean: 

Extract whole text from Document Information extraction

mtavvale
Discoverer
0 Kudos
140

Hello.

We are currently using Document Information Extraction to read scanned PDF files with the SAP_OCROnly_schema document schema.

The OCR works great and we get the content using the Get All Pages Text API or the .get_document_text method in python, but the results are a JSON file with bounding boxes and the corresponding text.

Is there a way to get the whole text from the document (or page by page) instead of for each individual bounding box?

Thanks,

Manuel T.

 

 

 

View Entire Topic
tobias_weller
Advisor
Advisor
0 Kudos

Hi Manuel, 
We only support the JSON output as of now, the easiest solution would be to write a small script that converts the JSON output to pure text.
You can also use the influence program to raise feature requests towards the product: https://influence.sap.com/sap/ino/#campaign/3667

Best regards,
Tobias

mtavvale
Discoverer
0 Kudos

Thanks Tobias.

I have raised the request, I think it will be an improvement, specially for generative AI processing.

 https://influence.sap.com/sap/ino/#/idea/325306