Enhance Document Processing with IBM Granite Vision 3.2 Cognitive Actions

In the realm of intelligent document processing, the IBM Granite Vision 3.2 offers a powerful set of Cognitive Actions designed to extract and interpret content from visual documents. These actions simplify the integration of advanced document understanding capabilities into applications, allowing developers to leverage state-of-the-art models without deep expertise in machine learning.
Prerequisites
Before diving into the integration of Cognitive Actions, ensure you have the following:
- An API key for accessing the Cognitive Actions platform.
- Basic knowledge of making HTTP requests and handling JSON data.
Authentication typically involves including your API key in the request headers, allowing secure access to the service.
Cognitive Actions Overview
Analyze Visual Documents
The Analyze Visual Documents action utilizes the Granite-Vision-3.2-2B model to extract and understand content from visual documents, such as tables, charts, and diagrams. This action falls under the document-processing category.
Input
The input to this action requires a structured JSON object as defined in the following schema:
{
"image": "https://upload.wikimedia.org/wikipedia/commons/e/e1/FullMoon2010.jpg",
"prompt": "Describe this image",
"maxTokens": 512,
"temperature": 0.6,
"systemPrompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.",
"topProbability": 0.9
}
- image (string): A URI pointing to the image input.
- prompt (string): The text input to describe or augment the image.
- maxTokens (integer): The maximum number of tokens the model should generate as output.
- temperature (number): Controls the randomness of the model's output (default is 0.6).
- systemPrompt (string): Sets context for the model's responses.
- topProbability (number): Determines the likelihood threshold for token generation (default is 0.9).
Example Input
{
"image": "https://upload.wikimedia.org/wikipedia/commons/e/e1/FullMoon2010.jpg",
"prompt": "Describe this image",
"maxTokens": 512,
"temperature": 0.6,
"systemPrompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.",
"topProbability": 0.9
}
Output
The output from this action typically returns an array of strings, which represent the generated text describing the visual content. For example:
[
"",
"The",
" image",
" de",
"pict",
"s",
" the",
" Mo",
"on",
",",
" captured",
...
]
This output provides a detailed description of the visual content present in the provided image.
Conceptual Usage Example (Python)
Here is a conceptual Python code snippet demonstrating how a developer might call this action:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "ed2f0374-4c72-4cb7-8405-84b9e545e0bd" # Action ID for Analyze Visual Documents
# Construct the input payload based on the action's requirements
payload = {
"image": "https://upload.wikimedia.org/wikipedia/commons/e/e1/FullMoon2010.jpg",
"prompt": "Describe this image",
"maxTokens": 512,
"temperature": 0.6,
"systemPrompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.",
"topProbability": 0.9
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload is structured according to the input schema, and the action ID is specified for the Analyze Visual Documents action.
Conclusion
The Cognitive Actions provided by IBM's Granite Vision 3.2 empower developers to seamlessly integrate advanced document processing capabilities into their applications. By utilizing the Analyze Visual Documents action, you can extract meaningful insights from visual content, thereby enhancing user experiences and automating workflows. Consider exploring other potential use cases and applications of this powerful tool in your projects!