Enhance Your Applications with Qwen-VL-Chat Cognitive Actions for Multimodal Interactions

Integrating advanced functionalities into applications can significantly enhance user experience. The nomagick/qwen-vl-chat API offers developers a powerful Cognitive Action that leverages the Qwen-VL-Chat model. This action facilitates streaming multimodal interactions, enabling applications to analyze and respond to both text and images dynamically. In this article, we'll explore the capabilities of the Execute Qwen-VL-Chat with Streaming action, its input and output requirements, and how to effectively implement it in your applications.
Prerequisites
Before diving into the implementation, ensure that you have the following ready:
- An API key for the Cognitive Actions platform to authenticate your requests.
- Basic knowledge of how to structure JSON payloads for API calls.
Authentication is generally done by passing the API key in the request headers, ensuring that your application can securely access the Cognitive Actions.
Cognitive Actions Overview
Execute Qwen-VL-Chat with Streaming
The Execute Qwen-VL-Chat with Streaming action enables developers to utilize the Qwen-VL-Chat model through a ChatML prompt interface. This action is particularly valuable for applications that require detailed analysis and responses based on image and text inputs.
Input
The input for this action is structured as follows:
- topP: Controls randomness; a higher value allows sampling from more of the distribution. Must be between 0 and 1. Default is 0.8.
- prompt: The initial text provided for completion, formatted in ChatML. Default is a conversation starter example.
- temperature: Adjusts the randomness of responses. Higher values lead to more random outputs. Acceptable range is 0 to 5, with a default of 0.75.
- filesArchive: URI to an archive containing supplementary files referenced in the prompt, if additional images are required.
- primaryImage: The primary image URI, which can be optionally included.
- maximumTokens: The maximum number of new tokens to generate. Must be between 1 and 8192. The default is 2048.
- secondaryImage: The URI for the optional second image.
- tertiaryImage: The URI for the optional third image.
Example Input:
{
"topP": 0.8,
"prompt": "<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nGiven this image: <img>image1</img>, point out where the dog is<|im_end|>\n<|im_start|>assistant\n",
"temperature": 0.75,
"primaryImage": "https://replicate.delivery/pbxt/JfWlCzhD5GBoiDRSmNhMpyYCrmd78lKkLU2JFgr1imbJZIIN/demo.jpeg",
"maximumTokens": 2048
}
Output
The output from this action typically contains a detailed response including references and bounding boxes indicating the location of identified objects within the provided images.
Example Output:
[
"<ref>",
" the",
" dog",
"</ref>",
"<box>",
"(",
2,
1,
1,
",",
4,
2,
7,
"),(",
5,
6,
9,
",",
8,
9,
2,
")",
"</box>"
]
Conceptual Usage Example (Python)
Here’s a conceptual example of how to call the Execute Qwen-VL-Chat with Streaming action using a hypothetical API endpoint in Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "88e55ed8-af50-4e9e-add0-02b1430bd469" # Action ID for Execute Qwen-VL-Chat with Streaming
# Construct the input payload based on the action's requirements
payload = {
"topP": 0.8,
"prompt": "<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nGiven this image: <img>image1</img>, point out where the dog is<|im_end|>\n<|im_start|>assistant\n",
"temperature": 0.75,
"primaryImage": "https://replicate.delivery/pbxt/JfWlCzhD5GBoiDRSmNhMpyYCrmd78lKkLU2JFgr1imbJZIIN/demo.jpeg",
"maximumTokens": 2048
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, the action_id corresponds to the Execute Qwen-VL-Chat with Streaming action. The input payload is constructed based on the action’s requirements, allowing developers to send text and image inputs for analysis and response.
Conclusion
The Execute Qwen-VL-Chat with Streaming action provides a robust solution for integrating multimodal interactions into applications. By leveraging this action, developers can enhance user engagement through detailed analyses of images and text inputs. As you explore the possibilities of the Qwen-VL-Chat model, consider how it can be applied to various use cases, such as interactive applications, educational tools, and more. Happy coding!