Generate Engaging Text with Gemma 3's Multimodal Capabilities

Gemma 3 12b It is an advanced Cognitive Actions service designed to leverage the power of AI to generate text from both text and image inputs. Utilizing a state-of-the-art multimodal model developed by Google, it offers remarkable performance with a generous 128K context window, enabling it to produce high-quality, relevant responses. This service supports over 140 languages, making it an invaluable asset for global applications.
The key benefits of integrating Gemma 3 include speed, efficiency, and versatility. Whether you need to generate descriptive text for images, answer questions, or summarize content, Gemma 3 simplifies the process, allowing developers to focus on building innovative solutions. Common use cases include content creation for marketing, enhancing accessibility for visually impaired users through image descriptions, and enriching user interactions in chatbots.
To get started, you'll need a Cognitive Actions API key and a basic understanding of making API calls.
Generate Text with Gemma 3
The "Generate Text with Gemma 3" action allows developers to create text based on prompts, which can include images as additional context. This action is particularly useful for generating detailed descriptions, enhancing user engagement, and providing informative content in various formats, such as blogs, social media posts, and educational materials.
Input Requirements
The input for this action must be structured as a JSON object containing the following required fields:
- prompt: This is the primary text prompt for the model to process (e.g., "Describe this image in detail.").
- image: An optional URI for an image input that can be used for multimodal tasks, such as image captioning.
- Additional parameters include topK, topP, temperature, maxNewTokens, and systemPrompt, which help fine-tune the output.
Here's an example of the input structure:
{
"topK": 50,
"topP": 0.9,
"image": "https://replicate.delivery/pbxt/MeBv1PWmcTf7voSh3U4fxefjKrtfNXaqmfX3UY4Iq6ZYDSlh/bee.jpg",
"prompt": "Describe this image in detail.",
"temperature": 0.7,
"maxNewTokens": 512,
"systemPrompt": "You are a helpful assistant."
}
Expected Output
The expected output is a detailed text description of the input image or an elaboration on the provided prompt. For instance, if an image of a flower garden is submitted, the model might return a comprehensive description covering various aspects like colors, elements, and composition.
Example output:
Here's a detailed description of the image:
**Overall Impression:**
The image is a close-up shot of a vibrant garden scene, focusing on pink cosmos flowers and a busy bumblebee...
**Main Elements:**
- **Cosmos Flowers**: The dominant feature is a cluster of pink cosmos flowers...
- **Bumblebee**: A bumblebee is prominently positioned on one of the cosmos flowers...
**Color Palette:**
- **Dominant Colors**: Pink, green, and yellow...
Use Cases for this Action
- Content Creation: Automatically generate descriptive content for blogs, articles, or social media posts to enhance engagement.
- Accessibility: Provide text descriptions for images to improve accessibility for users with visual impairments.
- Chatbots and Virtual Assistants: Integrate into conversational agents to offer users detailed information based on image inputs or prompts.
- Education: Aid in creating educational materials by generating relevant text based on images or specific topics.
import requests
import json
# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"
action_id = "9671dec0-d056-48f8-8633-bb78c211ab7b" # Action ID for: Generate Text with Gemma 3
# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
"topK": 50,
"topP": 0.9,
"image": "https://replicate.delivery/pbxt/MeBv1PWmcTf7voSh3U4fxefjKrtfNXaqmfX3UY4Iq6ZYDSlh/bee.jpg",
"prompt": "Describe this image in detail.",
"temperature": 0.7,
"maxNewTokens": 512,
"systemPrompt": "You are a helpful assistant."
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json",
# Add any other required headers for the Cognitive Actions API
}
# Prepare the request body for the hypothetical execution endpoint
request_body = {
"action_id": action_id,
"inputs": payload
}
print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json=request_body
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully. Result:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body (non-JSON): {e.response.text}")
print("------------------------------------------------")
In conclusion, Gemma 3's ability to generate text through multimodal inputs opens up a world of possibilities for developers. By streamlining the process of content creation and enhancing user experiences, it empowers you to build more interactive and informative applications. Explore the diverse use cases and consider how you can leverage this powerful AI tool in your next project.