Unlocking Multimodal Insights with the Obsidian-3B V0.5 Cognitive Actions

23 Apr 2025
Unlocking Multimodal Insights with the Obsidian-3B V0.5 Cognitive Actions

The tomasmcm/obsidian-3b-v0.5 API provides developers with advanced Cognitive Actions designed to leverage the power of multimodal analysis. With the ability to process and interpret both textual prompts and visual elements, these actions can greatly enhance applications that require nuanced understanding of images alongside text. The pre-built capabilities of these actions streamline the integration process, allowing developers to focus on building innovative solutions without needing to dive deep into complex model training.

Prerequisites

Before you start using the Cognitive Actions, make sure you have the following:

  • An API key for authenticating against the Cognitive Actions platform.
  • Basic knowledge of JSON and API requests.
  • A suitable development environment set up for making HTTP requests.

Authentication typically involves passing your API key in the headers of your requests to secure access to the Cognitive Actions.

Cognitive Actions Overview

Analyze Multimodal Content

The Analyze Multimodal Content action utilizes the Obsidian-3B-V0.5 model to interpret input that combines both language and visual elements. This action is perfect for applications that need to analyze images in conjunction with textual queries, making it the world's smallest multi-modal language model.

Input:

  • imageFile (required): A URI of the image to be analyzed. For example:
    "imageFile": "https://replicate.delivery/pbxt/JtmRbGjiYipZA4IRMGHTveQWqbgGscNGDSYQhsS35iW4KmRU/extreme_ironing.jpg"
    
  • prompt (required): A text prompt to guide the analysis. For example:
    "prompt": "What is unusual about this image?"
    
  • temperature (optional): A float value that influences the randomness of the output. Default is 0.2.
  • maxNewTokens (optional): Maximum number of new tokens the model can generate. Default is 512.
  • debug (optional): Boolean to activate detailed logging. Default is false.

Example Input:

{
  "debug": false,
  "prompt": "What is unusual about this image?",
  "imageFile": "https://replicate.delivery/pbxt/JtmRbGjiYipZA4IRMGHTveQWqbgGscNGDSYQhsS35iW4KmRU/extreme_ironing.jpg",
  "temperature": 0.2,
  "maxNewTokens": 512
}

Output: The action will return a text analysis of the image based on the prompt provided. For instance:

The unusual aspect of this image is that a man is standing on top of a yellow vehicle, which appears to be a taxi. This is not a common sight, as people usually do not stand on top of a moving vehicle...

Conceptual Usage Example (Python): Here’s how you can invoke the Analyze Multimodal Content action in Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "ef7dcfa9-4761-4abc-b734-e135b2d2f6c1"  # Action ID for Analyze Multimodal Content

# Construct the input payload based on the action's requirements
payload = {
    "debug": false,
    "prompt": "What is unusual about this image?",
    "imageFile": "https://replicate.delivery/pbxt/JtmRbGjiYipZA4IRMGHTveQWqbgGscNGDSYQhsS35iW4KmRU/extreme_ironing.jpg",
    "temperature": 0.2,
    "maxNewTokens": 512
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • The action_id is set to the ID of the Analyze Multimodal Content action.
  • The payload is constructed using the example input data.
  • The request is sent to the hypothetical Cognitive Actions endpoint, which is structured to include the action ID and the input payload.

Conclusion

The Cognitive Actions provided by the Obsidian-3B-V0.5 model enable developers to easily integrate multimodal analysis into their applications. By leveraging these pre-built actions, developers can create intelligent applications that understand and interpret content in a richer, more nuanced way. Consider exploring further use cases, such as integrating these capabilities into chatbots, content moderation tools, or creative applications that require visual analysis alongside text.