Enhance Your Applications with Multi-Modal Capabilities Using UniVAL Cognitive Actions

21 Apr 2025
Enhance Your Applications with Multi-Modal Capabilities Using UniVAL Cognitive Actions

Integrating advanced capabilities into your applications has never been easier, thanks to the cjwbw/unival spec, which offers a powerful set of pre-built Cognitive Actions designed for multi-modal tasks. The primary action available allows developers to leverage the UniVAL model for unified tasks, including image, video, and audio captioning, as well as visual grounding. These actions maximize efficiency and precision, enabling you to enhance user experiences without the heavy lifting of building complex models from scratch.

Prerequisites

Before diving into the implementation of Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Familiarity with making HTTP requests, particularly POST requests.
  • Basic understanding of JSON for structuring input and output data.

Authentication typically involves passing your API key in the request headers to authorize access to the Cognitive Actions services.

Cognitive Actions Overview

Execute Multi-Modal Task

The Execute Multi-Modal Task action allows for a unified approach to various tasks such as image captioning, video captioning, audio captioning, and visual grounding. This action is particularly useful for developers looking to incorporate multi-modal functionalities into their applications.

Input

The input for this action is structured as follows:

  • taskType: Specifies the type of task to perform. Options include:
    • Image Captioning
    • Video Captioning
    • Audio Captioning
    • Visual Grounding
    • General
    • General Video
    • Default is Image Captioning.
  • inputAudio: A URI pointing to the input audio file (used when task is Audio Captioning).
  • inputImage: A URI pointing to the input image file (used when task is Image Captioning or Visual Grounding).
  • inputVideo: A URI pointing to the input video file (used when task is Video Captioning or General Video).
  • instruction: Instructions for the task. Defaults to "What does the image/video/audio describe?" for Captioning tasks.

Example Input:

{
  "taskType": "Visual Grounding",
  "inputImage": "https://replicate.delivery/pbxt/JKlg8Ru5XgAnr03vzxqrP5V9ztpyfYuaNpUm0hVXs0XEKmu1/1.png",
  "instruction": "detached banana"
}

Output

The output from this action typically includes:

  • answer: The answer to the task, which may be null for certain tasks.
  • output: A URI pointing to the generated output, which can be an image, caption, or other relevant data based on the task.

Example Output:

{
  "answer": null,
  "output": "https://assets.cognitiveactions.com/invocations/f0d7421f-20f9-4207-aaad-db181b4435ae/6282eb65-69b4-42f7-8856-f366853bbbbb.png"
}

Conceptual Usage Example (Python)

Here’s how you might call the Execute Multi-Modal Task action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "f038fbd1-693f-415a-88c4-89ee883f27e8"  # Action ID for Execute Multi-Modal Task

# Construct the input payload based on the action's requirements
payload = {
    "taskType": "Visual Grounding",
    "inputImage": "https://replicate.delivery/pbxt/JKlg8Ru5XgAnr03vzxqrP5V9ztpyfYuaNpUm0hVXs0XEKmu1/1.png",
    "instruction": "detached banana"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key. The payload is structured according to the action's input schema, and the request is sent to a hypothetical endpoint for executing the action.

Conclusion

The cjwbw/unival Cognitive Actions offer a robust solution for integrating multi-modal capabilities into your applications. By utilizing the Execute Multi-Modal Task action, developers can efficiently handle a variety of tasks, enhancing the richness of user interactions. Explore these actions further to create innovative experiences in your applications, and consider additional use cases in your projects!