llkanetworks.blogg.se - Excel import pdf table

This link was a good reference while figuring out how to find tables. I'll provide some brief examples for a couple of the steps that do require code. Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I wrote a python package with modules that can help with those steps. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.Ĭombine the extracted text of each cell into the format you need. Use OpenCV to find and extract each cell from the table. Use Tesseract to detect rotation and ImageMagick mogrify to fix it. Use pdfimages from to turn the pages of the pdf into images. I could not find a workable off-the-shelf solution nothing that gave me the accuracy I needed.

This answer is for anyone encountering pdfs with images and needing to use OCR.