Unlocking Text Extraction: Integrate OCR with hexiaochun/pp-ocr-v4 Cognitive Actions

22 Apr 2025
Unlocking Text Extraction: Integrate OCR with hexiaochun/pp-ocr-v4 Cognitive Actions

Optical Character Recognition (OCR) is a powerful technology that enables applications to extract text from images automatically. The hexiaochun/pp-ocr-v4 Cognitive Actions provide developers with the tools to implement this capability efficiently. With pre-built actions tailored for document OCR, you can easily integrate OCR functionality into your applications, allowing for seamless text extraction with support for various languages.

Prerequisites

Before we dive into the specifics of the Cognitive Actions, make sure you have the following:

  • An API key for the Cognitive Actions platform that you'll use to authenticate your requests.
  • A basic understanding of JSON structure, as the input and output for the actions will be in JSON format.

In general, authentication is typically handled by passing an API key in the request headers.

Cognitive Actions Overview

Perform Optical Character Recognition (OCR) on Images

This action executes text and image recognition using the hexiaochun/pp-ocr-v4 model. It efficiently extracts textual content from image files while supporting specified language options.

  • Category: document-ocr

Input

The input for this action requires the following fields:

  • image (required): A string representing the URI of the input image file. This must be a valid URL.
  • language (optional): A string representing the language model code, which is a two-letter ISO 639-1 code. The default value is "ch" (Chinese).

Example Input:

{
  "image": "https://replicate.delivery/pbxt/LU2RNiJHWHYLMvsZZjGktBFFVUB3OYR49mzp20Mln3WNPznP/output.jpg",
  "language": "ch"
}

Output

The output will typically return a JSON object containing the results of the OCR process. Each result includes:

  • box: An array of coordinates defining the bounding box around detected text.
  • text: The extracted text from the image.
  • confidence: A float representing the confidence level of the extracted text.

Example Output:

{
  "results": [
    {
      "box": [
        [161, 669],
        [418, 669],
        [418, 732],
        [161, 732]
      ],
      "text": "柔软舒适的",
      "confidence": 0.9974066019058228
    }
  ]
}

Conceptual Usage Example (Python)

Here’s how you might call the OCR action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "8f6cde58-bdf3-4148-9637-6a3b3ed616c5"  # Action ID for Perform Optical Character Recognition (OCR) on Images

# Construct the input payload based on the action's requirements
payload = {
    "image": "https://replicate.delivery/pbxt/LU2RNiJHWHYLMvsZZjGktBFFVUB3OYR49mzp20Mln3WNPznP/output.jpg",
    "language": "ch"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key and ensure the endpoint URL is correct. You'll see how the action ID and input payload are structured to invoke the OCR action effectively.

Conclusion

By leveraging the hexiaochun/pp-ocr-v4 Cognitive Actions, you can significantly enhance your applications with advanced OCR capabilities. From extracting text from images to providing language support, these actions simplify the integration process. Consider exploring other use cases, such as automating document digitization or building intelligent content analysis tools. With these powerful tools at your disposal, the possibilities are limitless!