Generate Image Descriptions and Visual Answers with UForm-Gen2 Cognitive Actions

23 Apr 2025
Generate Image Descriptions and Visual Answers with UForm-Gen2 Cognitive Actions

In the rapidly evolving landscape of artificial intelligence, the UForm-Gen2-Qwen-500m API offers powerful Cognitive Actions that enable developers to harness the capabilities of multimodal AI. Among these actions, the ability to generate image descriptions and answer visual questions stands out, providing users with efficient tools for understanding and generating content based on image inputs. This blog post will guide you through the key features of this action, highlighting its inputs, outputs, and practical usage examples.

Prerequisites

Before diving into the integration of the UForm-Gen2 Cognitive Actions, ensure you have the following prerequisites:

  • An API key for the Cognitive Actions platform.
  • Basic knowledge of how to make HTTP requests.
  • Familiarity with JSON payload structures.

Authentication typically involves passing the API key in the request headers to authenticate your requests.

Cognitive Actions Overview

Generate Image Descriptions and Answer Visual Questions

This action utilizes UForm-Gen to generate captions for images and respond to visual inquiries. It leverages a compact multimodal AI model to efficiently process and understand image content.

Input

The input for this action follows the CompositeRequest schema. Here’s a breakdown of the required and optional fields:

  • image (required): The URI of the input image.
  • prompt (optional): Instructions for describing or analyzing the input image. The default value is "Describe the image in three sentences."
  • maxNewTokens (optional): The maximum number of tokens to generate in response to the prompt, defaulting to 256.

Example Input:

{
    "image": "https://replicate.delivery/pbxt/KPrmoP0t3TNpwsHNV5TmwJjcK1xQb0Vhw2AAtu9P7x7Sca4F/cat.jpg",
    "prompt": "Describe the image in three sentences.",
    "maxNewTokens": 256
}

Output

The output of this action is a concise description of the image based on the provided prompt. Here’s an example of what you might expect:

Example Output:

A white and orange cat stands on its hind legs, reaching for a white teapot on a wooden table in a garden. The teapot is on a white tablecloth, and a basket of red raspberries is nearby. The cat's position and actions create a playful and charming scene.

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet to illustrate how a developer might call a hypothetical Cognitive Actions execution endpoint for this action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "03abd500-c19b-4790-b936-24ae90a88707" # Action ID for Generate Image Descriptions and Answer Visual Questions

# Construct the input payload based on the action's requirements
payload = {
    "image": "https://replicate.delivery/pbxt/KPrmoP0t3TNpwsHNV5TmwJjcK1xQb0Vhw2AAtu9P7x7Sca4F/cat.jpg",
    "prompt": "Describe the image in three sentences.",
    "maxNewTokens": 256
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action_id is set for the Generate Image Descriptions action, and the payload is constructed based on the required input schema.

Conclusion

The UForm-Gen2 Cognitive Actions present a remarkable opportunity for developers to integrate advanced image processing capabilities into their applications. By leveraging the ability to generate image descriptions and answer visual questions, you can enhance user experiences and create more interactive applications. Start integrating these Cognitive Actions today to explore their full potential and consider extending their use to other multimodal tasks!