Streamline Document Processing with Optical Character Recognition Using lucataco/olmocr-7b Actions

22 Apr 2025
Streamline Document Processing with Optical Character Recognition Using lucataco/olmocr-7b Actions

Optical Character Recognition (OCR) technology has transformed how we handle document processing, enabling applications to extract text from images and PDFs efficiently. The lucataco/olmocr-7b API offers a powerful OCR action, leveraging a fine-tuned model for outstanding performance. In this article, we'll explore how to integrate the OCR capabilities of the olmOCR action into your applications, allowing you to automate the extraction of textual content from document images seamlessly.

Prerequisites

Before you begin using the olmOCR action, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic knowledge of making HTTP requests and handling JSON data.
  • A suitable development environment set up for making API calls (e.g., Python with the requests library).

Authentication typically involves including your API key in the request headers. This key verifies your identity and grants you access to the Cognitive Actions services.

Cognitive Actions Overview

Perform Optical Character Recognition with olmOCR

The Perform Optical Character Recognition with olmOCR action allows you to extract text from document images using the olmOCR model, which is based on the Qwen2-VL-7B-Instruct architecture. This action is particularly useful for processing large-scale document handling due to its efficiency and accuracy.

Input

The input schema for this action requires the following fields:

  • pdf (string, required): The URI of the input PDF file. This is the primary document from which text will be extracted.
  • pageNumber (integer, optional): Specifies the page number of the PDF to process. Defaults to 1.
  • temperature (number, optional): Controls the randomness of text generation. A higher value results in more randomness. Defaults to 0.8.
  • maxNewTokens (integer, optional): Sets the maximum number of new tokens to generate. Defaults to 100.

Example Input:

{
  "pdf": "https://replicate.delivery/pbxt/MZwEFkeOMRsr3TBINVyOTyNMTUeViRyLQsnWndmKJC7TH6gg/horribleocr.pdf",
  "pageNumber": 1,
  "temperature": 0.8,
  "maxNewTokens": 1024
}

Output

The action typically returns a JSON object containing the extracted text and related metadata. For instance:

{
  "primary_language": "en",
  "is_rotation_valid": true,
  "rotation_correction": 0,
  "is_table": false,
  "is_diagram": false,
  "natural_text": "Christians behaving themselves like Mahomedans..."
}

This output includes information on the primary language, whether the rotation is valid, and the actual extracted text.

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet to demonstrate how to invoke the Perform Optical Character Recognition with olmOCR action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "77a468f7-3d0e-469b-852c-106f52934055"  # Action ID for Perform Optical Character Recognition with olmOCR

# Construct the input payload based on the action's requirements
payload = {
    "pdf": "https://replicate.delivery/pbxt/MZwEFkeOMRsr3TBINVyOTyNMTUeViRyLQsnWndmKJC7TH6gg/horribleocr.pdf",
    "pageNumber": 1,
    "temperature": 0.8,
    "maxNewTokens": 1024
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, you replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload is structured according to the input schema, specifying the PDF URI and other parameters. The response is handled, and any errors are printed for debugging.

Conclusion

The Perform Optical Character Recognition with olmOCR action from the lucataco/olmocr-7b API provides developers with a robust solution for extracting text from documents efficiently. By integrating this action into your applications, you can automate document processing and enhance user experiences. Consider exploring additional use cases, such as automating data entry or improving accessibility in your applications. Happy coding!