Unlock Multimodal AI Capabilities with Magma 8b

The Magma 8b service brings a powerful suite of Cognitive Actions that enable developers to harness multimodal artificial intelligence for innovative applications. By leveraging the capabilities of the Microsoft Magma-8B model, these actions facilitate the generation of text and actions based on a combination of images and textual prompts. This results in a more interactive and intelligent system that can understand and respond to visual and textual information simultaneously.
Imagine being able to automate tasks that require comprehension of both visual and textual data, such as generating descriptions for images, answering questions about visual content, or even planning actions based on observed scenarios. The Magma 8b actions not only simplify these processes but also enhance the speed and accuracy of AI interactions across various fields such as education, content creation, and customer service.
Prerequisites
To get started with the Magma 8b Cognitive Actions, you will need an API key to access the service. A basic understanding of making API calls will also be beneficial.
Generate Multimodal AI Text and Actions
The Generate Multimodal AI Text and Actions action utilizes the Magma-8B model to produce textual outputs and actions based on input images and text prompts. This action excels in tasks involving image and video conditioned text generation, visual planning, and reasoning about spatial relationships.
Input Requirements
To use this action, you will need to provide:
- Image: A URL of the input image that the model will process (e.g.,
https://example.com/image.jpg). - Prompt: A textual prompt that guides the model's response (e.g., "What is in this image?").
- Temperature: A number between 0 and 2 that controls the randomness of the model's outputs.
- Use Sampling: A boolean value to determine if sampling should be used for generating responses.
- Num Beams: An integer that specifies how many beams to use during beam search for response generation (between 1 and 5).
- System Context: A string that defines the behavior context for the model's interactions.
- Max New Tokens: An integer setting the upper limit on the number of new tokens to generate (between 1 and 1024).
Expected Output
The output will be a textual response generated by the model based on the provided image and prompt. For instance, if the input image is a grid of animals, the output might specify which animal is in a particular position of the grid.
Example Output: "The animal in the first row, third column of the grid is a parrot."
Use Cases for this Specific Action
- Educational Tools: Create interactive learning experiences where students can ask questions about images in real-time.
- Content Generation: Automate the generation of captions or descriptions for visual content in media and marketing.
- Visual Assistants: Develop intelligent agents that can interpret and act based on visual inputs, enhancing user interaction.
- Research Applications: Explore complex scenarios in multimodal AI research to improve understanding of spatial reasoning and context-based actions.
```python
import requests
import json
# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"
action_id = "685ec16e-0a9f-4bc2-aa95-fd4d71df24f2" # Action ID for: Generate Multimodal AI Text and Actions
# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
"image": "https://replicate.delivery/pbxt/McWFm9sGqiPzDUxsLW5T9NuReWvYe8Z343emGIbiFIgEdfyr/replicate-prediction-3sp1e3c2v1rme0cneb5b8k5h6c.jpg",
"prompt": "The figure represents a 3x3 grid containing various animals where each one by one square is considered a block and each block contains an animal from bird, tiger, parrot, mouse. What is the animal of the block located at the first row third column of the grid?",
"temperature": 0,
"useSampling": false,
"numberOfBeams": 1,
"systemContext": "You are agent that can see, talk and act.",
"maximumNewTokens": 128
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json",
# Add any other required headers for the Cognitive Actions API
}
# Prepare the request body for the hypothetical execution endpoint
request_body = {
"action_id": action_id,
"inputs": payload
}
print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json=request_body
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully. Result:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body (non-JSON): {e.response.text}")
print("------------------------------------------------")
## Conclusion
The Magma 8b Cognitive Actions, particularly the ability to generate multimodal AI text and actions, open numerous possibilities for developers looking to integrate advanced AI capabilities into their applications. The combination of image processing and text generation not only streamlines workflows but also enhances user engagement and interaction.
As you explore these actions, consider how they can be applied in your projects to create more dynamic and intelligent systems. The next step could involve experimenting with different input parameters or combining this action with other services to build comprehensive AI-driven solutions.