Generate Multimodal Predictions with jyoung105/imp Cognitive Actions

21 Apr 2025
Generate Multimodal Predictions with jyoung105/imp Cognitive Actions

Integrating advanced machine learning capabilities into your applications has never been easier. The jyoung105/imp Cognitive Actions provide a set of pre-built actions designed to harness the power of multimodal small language models. Among these actions, the primary focus is on generating predictions based on input images and contextual prompts. This article will guide you through one of the key actions—Generate Multimodal Predictions—and show you how to integrate it into your applications effectively.

Prerequisites

Before you start using the Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic understanding of how to make API calls and handle JSON payloads.
  • Set up your development environment to include necessary libraries for making HTTP requests (like requests in Python).

Authentication typically involves passing your API key in the request headers to authorize your calls.

Cognitive Actions Overview

Generate Multimodal Predictions

The Generate Multimodal Predictions action is designed to produce predictions based on a combination of an input image and a guiding prompt. This action is part of the text-generation category, making it particularly useful for applications that require contextual understanding of visual content.

Input

The action requires the following fields in the input schema:

  • image (required): A URI pointing to the input image. This must be a valid URL.
  • prompt (required): A string prompt that instructs the model on what to generate.
  • topP (optional): A number that defines the top-p sampling parameter, with a default value of 0.95.
  • temperature (optional): A number that affects the randomness of token selection, with a default of 0.7.
  • maxNewTokens (optional): An integer specifying the maximum number of new tokens to generate, defaulting to 100.

Example Input:

{
  "topP": 0.95,
  "image": "https://replicate.delivery/pbxt/KJMIMNBJwHYQX1A4tfSmacSSxccDH3sVQtgzpwMuv88CbuJz/demo-1.jpg",
  "prompt": "What is the title of this book?",
  "temperature": 0.7,
  "maxNewTokens": 100
}

Output

The action typically returns a string output that represents the generated prediction. For instance, if the prompt is about a book, the output could be the title of the book illustrated in the input image.

Example Output:

"The Little Book of Deep Learning"

Conceptual Usage Example (Python)

Here’s how you can call the Generate Multimodal Predictions action using Python. The provided code snippet illustrates structuring the input payload correctly and making a request to the Cognitive Actions endpoint:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "f869d6eb-65c4-4ee4-944d-012df34ac460"  # Action ID for Generate Multimodal Predictions

# Construct the input payload based on the action's requirements
payload = {
    "topP": 0.95,
    "image": "https://replicate.delivery/pbxt/KJMIMNBJwHYQX1A4tfSmacSSxccDH3sVQtgzpwMuv88CbuJz/demo-1.jpg",
    "prompt": "What is the title of this book?",
    "temperature": 0.7,
    "maxNewTokens": 100
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload variable is structured according to the input schema, ensuring that all required fields are included. The response is then printed in a formatted manner for easy readability.

Conclusion

The Generate Multimodal Predictions action from the jyoung105/imp Cognitive Actions enables developers to create applications that understand and generate contextual information from visual inputs. Whether you're building a book recommendation system or an interactive storytelling app, this action provides a powerful tool to enhance user experience. Start integrating these Cognitive Actions today to bring your applications to life!