With something as powerful as HANA TA, there is an increasing need to do pre and post processing of various elements of a given document.
The first part about such post processing to identify document type is covered in my previous blog: HANA Text Analysis Married to Structured Statistical Models, It's a Brochure!
So, the scenario under discussion here is the table extraction from a document, say a PDF and make sense out of it for semantic search.
To make things more clear...assume we have table like:
Now, with HANA TA, or in general any natural language processing we cannot set the relationship that Model GX 5 has Power Consumption[kW] of 7.33
So, I got a very good starting point from the blog: http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
Here I had to adjust some bits for windows 64-bit, and had to adjust the cell recognition coding. But with that it was almost ready to be used.
So, what I followed is
So, as you can see the result below, it is the outcome of running the python with 4 steps above for the pdf whose representation is given above.
As we see some lines are missed, this is on virtue of noise in the image, i.e; the 3rd horizontal line has a lot of pixel noise, its not black enough.
Once we have this information, we can store this information against the document, page into the graph, and use it to search. Now, we have used tesseract which is open source and needs neat, high quality image, however one could opt using a good licensed OCR or licensed PDF readers and play around with it.
With this approach all meta data influences, like font size, table, italics, bolds etc can now be considered to enhance the semantics net build-up as a supplementary pre/post processing over HANA TA.
Hope this blog helps you, awaiting your feedback and the usecases on your wish-list this could help achieve.
Cheers,
Jemin
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
11 | |
7 | |
7 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 | |
3 |