Unlocking Visual Insights: Integrate Multimodal Question Answering with Bunny Phi-2 SigLIP Actions

21 Apr 2025
Unlocking Visual Insights: Integrate Multimodal Question Answering with Bunny Phi-2 SigLIP Actions

In today's digital landscape, the ability to analyze images and derive meaningful insights through natural language prompts has become crucial. The adirik/bunny-phi-2-siglip API offers a powerful Cognitive Action that harnesses the Bunny model—an efficient multimodal model based on SigLIP and Phi-2 by BAAI-DCAI. This action allows developers to perform visual question answering and image captioning seamlessly, enabling a wide range of applications from chatbots to educational tools.

Prerequisites

Before you start integrating the Bunny Cognitive Actions into your application, ensure you have the following:

  • API Key: To authenticate your requests, you'll need an API key from the Cognitive Actions platform.
  • Setup: Familiarity with making HTTP requests in your programming environment (e.g., Python, JavaScript).

Authentication typically involves passing your API key in the headers of your requests.

Cognitive Actions Overview

Perform Multimodal Visual Question Answering and Captioning

This action uses the Bunny model to provide intelligent responses based on visual input and textual prompts. It can effectively analyze images and generate captions or answers to questions posed about those images.

Input

The input for this action is structured as follows:

  • image (required): A URI pointing to the image to be analyzed.
  • prompt (required): A text prompt or question related to the image.
  • temperature (optional): A number controlling the creativity of the output, with a default of 0.2.
  • maxNewTokens (optional): The maximum number of new tokens to generate in the response, defaulting to 100.
  • topProbability (optional): A value between 0 and 1 that determines the nucleus sampling threshold, defaulting to 0.7.

Example Input:

{
  "image": "https://replicate.delivery/pbxt/KTFhbC8R9gUwjn6ETus3lIYjaTLR5zTxmxwWD5iPwRuakm75/example_1.png",
  "prompt": "What is the astronaut holding in his hand?",
  "temperature": 0.2,
  "maxNewTokens": 100,
  "topProbability": 0.7
}

Output

When executed, this action returns a textual response to the prompt based on the analyzed image.

Example Output: "The astronaut is holding a green beer bottle in his hand."

Conceptual Usage Example (Python)

Below is a conceptual Python snippet demonstrating how to invoke this action using a hypothetical Cognitive Actions API endpoint:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "d9f0aad7-a595-45fd-8a7b-73e6e4bef591"  # Action ID for Perform Multimodal Visual Question Answering and Captioning

# Construct the input payload based on the action's requirements
payload = {
    "image": "https://replicate.delivery/pbxt/KTFhbC8R9gUwjn6ETus3lIYjaTLR5zTxmxwWD5iPwRuakm75/example_1.png",
    "prompt": "What is the astronaut holding in his hand?",
    "temperature": 0.2,
    "maxNewTokens": 100,
    "topProbability": 0.7
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code, replace the COGNITIVE_ACTIONS_API_KEY and the endpoint URL with your actual API key and endpoint. The action ID is specified for the "Perform Multimodal Visual Question Answering and Captioning" action, and the input payload matches the expected schema.

Conclusion

The Bunny model's Cognitive Action for multimodal visual question answering and captioning offers developers a robust tool for deriving insights from images using natural language queries. By integrating this capability, you can enhance the interactivity and intelligence of your applications. Explore various use cases, from customer support to educational platforms, and leverage this technology to create engaging user experiences. Happy coding!