Enhance Your Applications with Multimodal Responses Using lucataco/qwen2.5-omni-7b Actions

In today's rapidly evolving digital landscape, the ability to engage users through multiple modalities—text, images, audio, and video—is essential. The lucataco/qwen2.5-omni-7b Cognitive Actions enable developers to leverage the Qwen2.5-Omni model to create rich, interactive experiences that seamlessly integrate these modalities. This model excels in generating natural text and speech responses, enhancing user interactions while providing a versatile tool for various applications.
Prerequisites
Before diving into the Cognitive Actions, ensure you have the following:
- An API key for the Cognitive Actions platform to authenticate your requests.
- Basic understanding of making HTTP requests in your preferred programming language.
Authentication can typically be handled by passing your API key in the headers of your requests.
Cognitive Actions Overview
Generate Multimodal Responses with Qwen2.5-Omni
The Generate Multimodal Responses with Qwen2.5-Omni action enables you to utilize the Qwen2.5-Omni model to receive and process various input formats, including text prompts, audio files, images, and videos. This action allows for real-time, robust interactions that can cater to diverse user needs.
- Category: Multimodal Interaction
Input
The input for this action is structured as a JSON object with the following fields:
{
"audio": "optional_uri_string",
"image": "optional_uri_string",
"video": "optional_uri_string",
"prompt": "string",
"voiceType": "Chelsie | Ethan",
"systemPrompt": "string",
"generateAudio": true | false,
"useAudioInVideo": true | false
}
Example Input:
{
"video": "https://replicate.delivery/pbxt/MmJqxKbRSknHd9fwtTEbywWqDDdhgsx5tNYLIDnFqJ9j5ObC/draw.mp4",
"voiceType": "Chelsie",
"systemPrompt": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
"generateAudio": true,
"useAudioInVideo": true
}
- Required Fields:
prompt: A textual prompt to guide the model's response.
- Optional Fields:
audio: URI for an audio input file.image: URI for an image input file.video: URI for a video input file.voiceType: Specified voice type for audio output (defaults to "Chelsie").systemPrompt: Initializes model context (defaults to a pre-defined statement).generateAudio: Indicates if audio output should be generated (defaults to true).useAudioInVideo: Indicates if audio should be embedded in the video (defaults to true).
Output
Upon successful execution, the action typically returns a JSON object with the following structure:
{
"text": "string",
"voice": "uri_string"
}
Example Output:
{
"text": "Oh, that's a really cool drawing! It looks like a guitar. You've got the body and the neck drawn in a simple yet effective way. The lines are clean and the shape is well-defined. What made you choose to draw a guitar?",
"voice": "https://assets.cognitiveactions.com/invocations/265ea461-43c4-4974-a29c-bbc3b4d9de23/ab2fb733-8958-493a-8034-6135237c6090.wav"
}
- Returned Fields:
text: The generated textual response based on the input.voice: A URI pointing to the generated audio file that speaks the response.
Conceptual Usage Example (Python)
Here’s how you can invoke the action using Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "85bb5e55-d8ce-4359-8f58-ae3d321a0f14" # Action ID for Generate Multimodal Responses with Qwen2.5-Omni
# Construct the input payload based on the action's requirements
payload = {
"video": "https://replicate.delivery/pbxt/MmJqxKbRSknHd9fwtTEbywWqDDdhgsx5tNYLIDnFqJ9j5ObC/draw.mp4",
"voiceType": "Chelsie",
"systemPrompt": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
"generateAudio": true,
"useAudioInVideo": true
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this example, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The code constructs the input payload, including a video URI and required parameters, and sends a POST request to the hypothetical Cognitive Actions execution endpoint.
Conclusion
The lucataco/qwen2.5-omni-7b Cognitive Actions empower developers to create engaging and responsive applications that utilize multimodal inputs and outputs. With the ability to generate natural text and speech responses based on diverse media, you can enhance user experiences significantly. Consider integrating these actions into your applications to explore the vast potential of multi-modal interactions!