As explained by abinaya.seenivasan in her blog about OCR, I have followed similar steps and wanted to share few steps which I have felt missing there along with understanding and learnings.


1. Download Tesseract OCR from below link for Windows


2. Update Training data from Tesseract Best data


3. Simple Python Script required to execute OCR
import pytesseract
import json
import os

# provide the file path where all your files located
filePath = 'D:\\IRPA_TEMP\\'

#Set location where Tessearct is installed
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

#Consider all files from the folder
files = os.listdir(filePath)

#Considering file is already in image format
for file in files:
#Convert image to string for a specific language, by default lenaguage is english
outText = pytesseract.image_to_string(filePath + file, lang='tha', config='--psm 11')

#Split in Array of String
outTextList = outText.split('\n')

#extract data from above string and prepare Json file
#jsondata = { "Vendor" : outTextList[1] }

#Save extracted json data into json file
with open(filePath + fileName + '.json','w') as json_file:
json.dump(jsondata, json_file)


4. Config values generally used in Pytesseract
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,bypassing hacks that are Tesseract-specific.
