Enhance Your Application with Image Analysis: Integrating the InternLM-XComposer-2 Cognitive Actions

23 Apr 2025
Enhance Your Application with Image Analysis: Integrating the InternLM-XComposer-2 Cognitive Actions

In today's digital landscape, the ability to analyze and describe images accurately is invaluable. The InternLM-XComposer-2 offers a powerful set of Cognitive Actions designed to enhance your applications with advanced image analysis capabilities. Whether you're building a photo management app, a social media platform, or an e-commerce site, these pre-built actions can help you generate detailed descriptions of images, making your interface more intuitive and informative.

Prerequisites

Before you can start using the InternLM-XComposer-2 Cognitive Actions, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Familiarity with making HTTP requests and handling JSON data.

Authentication typically involves passing your API key in the request headers, allowing you to securely interact with the Cognitive Actions service.

Cognitive Actions Overview

Generate Detailed Image Description

The Generate Detailed Image Description action utilizes the InternLM-XComposer-2.5 to provide rich, contextually aware descriptions of images. This action captures various elements such as people's moods, clothing, environment, lighting, and colors, producing concise, comma-separated statements that enhance the understanding of the visual content.

  • Category: Image Analysis

Input

The input for this action requires a text prompt and a URI pointing to the image that needs to be described. Below is the input schema and an example:

  • Required Fields:
    • text: A detailed prompt for describing the image.
    • image: A public URI of the image.
{
  "text": "Caption this image. describe every single thing in the image in detail. Do not include any unnecessary words in your description for the sake of good grammar. I want many short statements that serve the single purpose of giving the most thorough description if items as possible in the smallest, comma separated way possible. Be sure to describe people's moods, clothing, the environment, lighting, colors, and everything.",
  "image": "https://replicate.delivery/pbxt/LVcv7tbU1l1rYK9xLxnuujJbgI1R15R8MhXp8wJu0sXtLdDq/dac2c2cdfc32e462a9d869ce1f00454c.jpg"
}

Example Output

When you invoke this action with the appropriate input, it returns a detailed description of the image. Here’s an example of what you might receive:

The image captures the majestic Potala Palace in Lhasa, Tibet. The palace, a symbol of Tibetan culture and history, stands tall on a snowy mountain, its white walls contrasting sharply with the surrounding landscape. The palace is surrounded by a moat, adding to its grandeur. The sky above is overcast, casting a soft light over the scene. The palace is adorned with red and gold decorations, adding a touch of color to the otherwise monochromatic landscape. The people in the image are dressed in traditional Tibetan attire, their red robes standing out against the white backdrop. The environment is cold and snowy, with the snow covering the ground and the mountains in the background. The lighting is soft and diffused, creating a serene and peaceful atmosphere. The colors in the image are predominantly white, red, and gold, with the red and gold accents adding a touch of warmth to the otherwise cold and snowy scene.

Conceptual Usage Example (Python)

Here’s a conceptual Python snippet demonstrating how to call the Cognitive Actions endpoint for generating a detailed image description:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "c2918833-cad6-4109-9484-7456e8db76e8"  # Action ID for Generate Detailed Image Description

# Construct the input payload based on the action's requirements
payload = {
    "text": "Caption this image. describe every single thing in the image in detail. Do not include any unnecessary words in your description for the sake of good grammar. I want many short statements that serve the single purpose of giving the most thorough description if items as possible in the smallest, comma separated way possible. Be sure to describe people's moods, clothing, the environment, lighting, colors, and everything.",
    "image": "https://replicate.delivery/pbxt/LVcv7tbU1l1rYK9xLxnuujJbgI1R15R8MhXp8wJu0sXtLdDq/dac2c2cdfc32e462a9d869ce1f00454c.jpg"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code, you replace the COGNITIVE_ACTIONS_API_KEY and endpoint URL with your actual values. The action ID and the input payload are structured according to the action's requirements, allowing you to retrieve a detailed description of the specified image.

Conclusion

The InternLM-XComposer-2 Cognitive Action for generating detailed image descriptions is a powerful tool for developers looking to enhance their applications with advanced image analysis capabilities. By integrating this action, you can provide users with rich, contextual descriptions that improve the understanding of visual content. As you explore these capabilities, consider how they can be applied in various use cases, from content creation to accessibility enhancements. Start integrating today and elevate your application's user experience!