Tesseract, an OCR (Optical Character Recognition) tool for Image Reading from Excel

BlogTesseract, an OCR (Optical Character Recognition) tool for Image Reading from Excel

Here is a detailed, step-by-step guide on how to extract text from images in an Excel sheet using Tesseract, an OCR (Optical Character Recognition) tool, along with Python. The process assumes no prior knowledge, so I’ll walk through every step, from installing Tesseract to writing a Python script that extracts text from images embedded in the Excel sheet.

Step 1: Install Required Tools and Libraries

1.1 Install Tesseract OCR

Tesseract is an open-source OCR engine that can extract text from images. First, you need to install it on your system.

  • Windows:
    • Download the Tesseract installer from here.
    • Run the installer and follow the prompts.
    • During installation, make sure to check the box that adds Tesseract to your system’s PATH. This makes it easier to use in Python.
  • Linux: You can install Tesseract via the terminal:bashCopy codesudo apt-get update sudo apt-get install tesseract-ocr
  • Mac: Use Homebrew to install Tesseract:bashCopy codebrew install tesseract

1.2 Install Python Libraries

You’ll need some Python libraries for reading the Excel file, processing the images, and extracting text. You can install all of them at once using pip:

bash

pip install openpyxl Pillow pytesseract

  • openpyxl: For reading Excel files and extracting embedded images.
  • Pillow: For image processing (this is a Python imaging library).
  • pytesseract: A Python wrapper for Tesseract to perform OCR on the images.

Step 2: Set Up Tesseract in Python

Once you have Tesseract installed, you need to point Python to the Tesseract executable so it can use it for OCR. This can be done by specifying the path where Tesseract is installed.

  • Windows: If Tesseract was added to PATH during installation, Python should automatically find it. If not, you need to provide the path:pythonCopy codeimport pytesseract # Specify the path to the Tesseract executable pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  • Linux/Mac: Tesseract is usually available globally after installation, so you don’t need to specify a path unless it’s installed in a custom location.

Step 3: Extract Images from Excel Sheet

Now that we have Tesseract set up, the next step is to extract the images embedded in your Excel sheet. We’ll use the openpyxl library for this.

Here’s how you can extract images from an Excel file:

  1. Open the Excel file using openpyxl.
  2. Loop through the worksheet to find and save embedded images.

python

from openpyxl import load_workbook from openpyxl.drawing.image import Image # Load the Excel workbook wb = load_workbook('your_excel_file.xlsx') # Select the active sheet (you can change this if necessary) sheet = wb.active # Iterate over the images in the sheet for idx, img in enumerate(sheet._images): # Save the image locally img_name = f'image_{idx}.png' img.image.save(img_name) print(f"Image saved as {img_name}")

This code saves all the embedded images from the Excel file as PNG files in your working directory.


Step 4: Use Tesseract to Perform OCR on Extracted Images

After extracting the images from the Excel sheet, we can now use Tesseract to perform OCR and extract text from these images.

python

from PIL import Image as PILImage import pytesseract # Example: Load the saved image and perform OCR img_name = 'image_0.png' # Replace with the actual image file name # Open the image using Pillow img = PILImage.open(img_name) # Perform OCR on the image using Tesseract extracted_text = pytesseract.image_to_string(img) # Output the extracted text print("Extracted Text:") print(extracted_text)

This code opens an image, runs Tesseract on it to extract the text, and prints the extracted text.


Step 5: Putting It All Together

Here’s a complete Python script that does the following:

  1. Extracts all images embedded in an Excel sheet.
  2. Applies Tesseract OCR to each image to extract text.
  3. Outputs the extracted text for each image.

python

from openpyxl import load_workbook from openpyxl.drawing.image import Image as OpenpyxlImage from PIL import Image as PILImage import pytesseract import os # Ensure Tesseract path is correctly set up (only required for Windows) # Uncomment and change the path below to your Tesseract installation if needed: # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Step 1: Load the Excel workbook workbook_path = 'your_excel_file.xlsx' wb = load_workbook(workbook_path) sheet = wb.active # You can specify a different sheet if needed # Step 2: Iterate over the images in the Excel sheet if not sheet._images: print("No images found in the Excel sheet.") else: # Create a directory to store the images if not os.path.exists('extracted_images'): os.makedirs('extracted_images') for idx, img in enumerate(sheet._images): # Save each image locally img_name = f"extracted_images/image_{idx}.png" img.image.save(img_name) print(f"Image {idx} saved as {img_name}") # Step 3: Open the saved image using Pillow with PILImage.open(img_name) as img_file: # Step 4: Perform OCR on the image using Tesseract extracted_text = pytesseract.image_to_string(img_file) # Step 5: Output the extracted text print(f"Extracted Text from Image {idx}:") print(extracted_text) print("-" * 50) # Separator between images print("Image extraction and OCR process completed.")

Explanation of the Code:

  1. Loading the Excel File:
    • The script opens the Excel file using openpyxl and selects the active sheet.
    • You can change the sheet = wb.active line to load a different sheet if needed by specifying the sheet name (e.g., sheet = wb['Sheet2']).
  2. Extracting Images:
    • The script loops over all images embedded in the worksheet (sheet._images) and saves each image as a PNG file in a folder called extracted_images.
    • Each image is saved with a unique name (e.g., image_0.png, image_1.png, etc.).
  3. Applying OCR to Each Image:
    • After saving each image, the script opens the image using the Pillow library and applies Tesseract OCR (pytesseract.image_to_string(img_file)) to extract any text present in the image.
    • The extracted text is printed to the console for review.
  4. Output and Clean-Up:
    • If no images are found in the Excel sheet, the script outputs a message stating that no images were found.
    • A separator line ("-" * 50) is printed after the extracted text of each image to make the output clearer.

Step 6: Running the Script

  1. Save the Script: Copy the Python script into a file, e.g., extract_text_from_images.py.
  2. Run the Script: Use a terminal or command prompt to run the script:bashCopy codepython extract_text_from_images.py
  3. Check the Output:
    • The extracted images will be saved in the extracted_images folder.
    • The extracted text from each image will be printed to the terminal/console.

Step 7: Fine-tuning and Troubleshooting

7.1 Improve OCR Accuracy

OCR accuracy can vary depending on the quality of the image and the text in it. Here are some tips to improve accuracy:

  • Image Preprocessing:
    • You can preprocess the image to improve OCR accuracy, for example, by converting it to grayscale or increasing the contrast.
    Example of converting an image to grayscale:pythonCopy codegray_img = img_file.convert('L') # Convert to grayscale extracted_text = pytesseract.image_to_string(gray_img)
  • Language Specification:
    • If the text in the images is in a different language, you can specify the language for Tesseract:
    pythonCopy codeextracted_text = pytesseract.image_to_string(img_file, lang='eng') # 'eng' is for English

7.2 Debugging Common Errors

  • Tesseract not found: If Tesseract isn’t properly installed or if Python can’t find it, make sure the path to the Tesseract executable is correctly set using:pythonCopy codepytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' Ensure Tesseract is installed in the specified path.
  • Low OCR Accuracy:
    • If the OCR results are poor, you may need to clean or preprocess the images, increase resolution, or check that Tesseract is trained on the correct language.

Step 8: Automating the Workflow

To scale the workflow:

  • You can run this script on multiple Excel files by modifying the input path to process each Excel file in a directory.
  • Automate the process to run regularly if you’re frequently receiving new Excel files containing embedded images.

Conclusion

Using Tesseract OCR with Python, you can automate the extraction of text from images embedded in Excel files. The step-by-step guide walks you through installing the necessary tools, extracting images from Excel sheets, and performing OCR to convert image-based text into machine-readable format. With this setup, you can now process large Excel documents with embedded images and use the extracted text in your production environment.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Ways to Test a RAG Architecture-Based Generative AI Application
Testing a Retrieval-Augmented Generation (RAG) architecture-based generative AI application is crucial to ensure it performs
AI Model Benchmarks: A Comprehensive Guide
In the rapidly evolving field of artificial intelligence, benchmark tests are essential tools for evaluating
NVIDIA MONAI A Comprehensive Guide to NVIDIA MONAI: Unlocking AI in Medical Imaging
Introduction The integration of Artificial Intelligence (AI) in healthcare has revolutionized the way medical professionals