Generate Image Descriptions with Cognitive Actions for Hayooucom Vision Model

24 Apr 2025
Generate Image Descriptions with Cognitive Actions for Hayooucom Vision Model

In the world of AI-driven image analysis, the Hayooucom Vision Model provides a powerful toolset to help developers implement sophisticated features in their applications. One of the standout capabilities is the ability to generate detailed and contextually relevant descriptions of images. This is achieved through the Cognitive Actions, specifically designed to enhance image understanding with customizable parameters. By harnessing these pre-built actions, developers can significantly reduce the time and effort required to build advanced image processing functionalities.

Prerequisites

Before diving into the integration of Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform, which will be used for authentication.
  • Familiarity with making HTTP requests and handling JSON data.

Conceptually, authentication typically involves passing the API key in the request headers, allowing your application to securely access the Cognitive Actions services.

Cognitive Actions Overview

Generate Image Descriptions with Phi-3 Vision

The Generate Image Descriptions with Phi-3 Vision action utilizes the Phi-3 Vision model to provide detailed descriptions based on images. It can analyze images provided through URLs or Base64 encoded data, making it flexible for various use cases. This action is especially beneficial for applications needing image captioning, accessibility features, or content generation.

Input

The input for this action requires a JSON object that can include the following fields:

  • seed (integer): The seed for the random number generator.
  • topK (integer, default: 1): Samples from the top k most likely tokens during text decoding.
  • topP (number, default: 1): Samples from the top p percentage of most likely tokens.
  • prompt (string, default: "hello, who are you?"): Text prompt to guide the model's response.
  • imageUrl (array of strings): Public image URLs to analyze.
  • maxTokens (integer, default: 45000): The maximum number of tokens to process.
  • imageBase64 (array of strings): Base64 encoded images to analyze if no URL is provided.
  • temperature (number, default: 0.1): Controls the randomness of outputs.
  • maxNewTokens (integer, default: 200): The maximum number of new tokens to generate.
  • systemPrompt (string, default: "You are a helpful AI assistant."): Instructions for the AI behavior.
  • repetitionPenalty (number, default: 1.1): Penalty applied to repetitive words in generated text.

Example Input:

{
  "topK": 1,
  "topP": 1,
  "prompt": "please describe this image.",
  "imageUrl": [
    "https://support.content.office.net/en-us/media/3dd2b79b-9160-403d-9967-af893d17b580.png"
  ],
  "maxTokens": 45000,
  "imageBase64": [],
  "temperature": 0.1,
  "maxNewTokens": 458,
  "systemPrompt": "You are a helpful AI assistant.",
  "repetitionPenalty": 1.1
}

Output

The output of this action is a list of strings that form a coherent description of the input image. Each string in the array typically represents parts of the generated text that describe the image accurately and informatively.

Example Output:

[
  "",
  "The ",
  "chart ",
  "is ",
  "a ",
  "spreadsheet ",
  "table ",
  "with ",
  "data ",
  "organized ",
  "into ",
  "columns ",
  "and ",
  "rows. ",
  "It ",
  "includes ",
  "headers ",
  "labeled ",
  "'Product', ",
  "'Qtr ",
  "1', ",
  "'Qtr ",
  "2', ",
  "and ",
  "'Grand ",
  "Total'. ",
  ...
]

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet demonstrating how to call the Cognitive Actions execution endpoint to generate image descriptions:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "97d54c7f-b616-4a71-a8e2-030b70b35cf0"  # Action ID for Generate Image Descriptions with Phi-3 Vision

# Construct the input payload based on the action's requirements
payload = {
    "topK": 1,
    "topP": 1,
    "prompt": "please describe this image.",
    "imageUrl": [
        "https://support.content.office.net/en-us/media/3dd2b79b-9160-403d-9967-af893d17b580.png"
    ],
    "maxTokens": 45000,
    "imageBase64": [],
    "temperature": 0.1,
    "maxNewTokens": 458,
    "systemPrompt": "You are a helpful AI assistant.",
    "repetitionPenalty": 1.1
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key.
  • The action_id is specified for the Generate Image Descriptions with Phi-3 Vision action.
  • The input payload is structured according to the action's requirements.
  • The endpoint URL and request structure are illustrative and may vary based on your actual setup.

Conclusion

The Hayooucom Vision Model's Cognitive Actions empower developers to integrate advanced image description capabilities into their applications with ease. By leveraging the Generate Image Descriptions with Phi-3 Vision action, you can enhance user experiences through detailed image analysis and contextual insights. Explore further use cases like automated content generation, accessibility improvements, or data visualization enhancements to fully harness the potential of this powerful toolset.