Unlocking Multimodal Capabilities with lucataco/phi-4-multimodal-instruct Actions

22 Apr 2025
Unlocking Multimodal Capabilities with lucataco/phi-4-multimodal-instruct Actions

Multimodal AI models have revolutionized how we interact with technology, allowing us to integrate text, images, and audio seamlessly. The lucataco/phi-4-multimodal-instruct API provides a powerful set of Cognitive Actions that enable developers to harness these advanced capabilities for various applications. By utilizing these pre-built actions, you can improve efficiency and accuracy across different scenarios, whether it be for transcribing audio, summarizing text, or describing images.

Prerequisites

Before you start integrating the Cognitive Actions from the lucataco/phi-4-multimodal-instruct API, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Basic understanding of JSON for structuring your requests.

Authentication typically involves passing your API key in the request headers to authorize your actions.

Cognitive Actions Overview

Process Multimodal Requests

The Process Multimodal Requests action allows you to execute and manage complex multimodal tasks using the Phi-4 multimodal-instruct model. It combines advanced language, vision, and speech capabilities to enhance efficiency and context awareness across multiple languages.

Input

The input for this action consists of several fields that you can customize based on your requirements:

  • task: An array of tasks to be performed. Available tasks include 'transcribe', 'summarize', and 'describe'.
  • text: The input text to process, which can include questions or prompts for further actions.
  • audio: An array of audio file URIs. Each URI must point to a valid audio file.
  • images: An array of image URIs. Each URI must point to a valid image file.
  • maxTokens: The maximum number of tokens to generate (default is 1000).
  • temperature: A sampling temperature to control randomness (default is 0.7).
  • systemPrompt: A prompt for the system's behavior or role (default is "You are a helpful assistant.").
Example Input

Here is a practical example of the JSON payload needed to invoke this action:

{
  "text": "What is shown in this image?",
  "images": [
    "https://replicate.delivery/pbxt/Ma2T6ufUKuLsiP3VzC9r1owPT0ObFQ4LTgG6f7LJ6Shg0Aez/australia.jpg"
  ],
  "maxTokens": 1000,
  "temperature": 0.7,
  "systemPrompt": "You are a helpful assistant."
}

Output

The action typically returns a descriptive output based on the tasks performed. For example, if you input an image of a stop sign, the output might be:

A stop sign in front of a building with Chinese writing on it.

Conceptual Usage Example (Python)

Below is a conceptual example of how you might call the Process Multimodal Requests action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "68c2a5c6-f4a7-4158-9ac6-a00a3bae2d8f"  # Action ID for Process Multimodal Requests

# Construct the input payload based on the action's requirements
payload = {
    "text": "What is shown in this image?",
    "images": [
        "https://replicate.delivery/pbxt/Ma2T6ufUKuLsiP3VzC9r1owPT0ObFQ4LTgG6f7LJ6Shg0Aez/australia.jpg"
    ],
    "maxTokens": 1000,
    "temperature": 0.7,
    "systemPrompt": "You are a helpful assistant."
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, you need to replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The input payload is structured according to the action's requirements, and the output will be printed upon successful execution.

Conclusion

The lucataco/phi-4-multimodal-instruct Cognitive Actions provide a robust way to integrate multimodal capabilities into your applications. By leveraging the Process Multimodal Requests action, you can enhance your application's ability to process and understand complex data types, ultimately leading to more interactive and intelligent user experiences. Explore the possibilities and consider how these actions can fit into your next project!