Enhance Your Applications with Image-to-Text Captioning Using ControlNet Actions

22 Apr 2025
Enhance Your Applications with Image-to-Text Captioning Using ControlNet Actions

In the rapidly evolving world of AI, the ability to automatically generate captions for images is invaluable. The zylim0702/controlnet-v1-1-multi API provides developers with a powerful Cognitive Action to achieve this: Generate Image-to-Text Caption with ControlNet. This action leverages advanced AI capabilities for image adaptation, upscale augmentation, and enhanced semantic interpretation, allowing you to enrich your applications with automatic image-to-text captioning.

Prerequisites

Before diving into the integration process, ensure that you have the following:

  • API Key: An API key for accessing the Cognitive Actions platform.
  • Setup: Familiarity with making HTTP requests and handling JSON data.

Authentication generally involves passing your API key in the request headers, allowing you to securely access the Cognitive Actions.

Cognitive Actions Overview

Generate Image-to-Text Caption with ControlNet

This action utilizes the ControlNet AI model to automate the process of generating captions from images. It integrates sophisticated features such as image adaptation and upscale augmentation, enhancing the model's ability to interpret visual content semantically.

Input

The input required for this action is encapsulated in a JSON object, which includes several parameters. The essential field is the image, which must be a valid URI. Here’s a breakdown of the input schema:

  • Required:
    • image (string): URI of the input image.
  • Optional:
    • eta (number): Noise level in the denoising process (default: 0).
    • seed (integer): For random number generation (ensures reproducibility).
    • scale (number): Classifier-free guidance scale (default: 9, range: 0.1 to 30).
    • prompt (string): Guides the model's output (e.g., "a dog in a bright sunshine jungle, hard lighting").
    • strength (number): Influence of the input image (default: 1).
    • structure (string): Structure type to condition on (default: "canny").
    • lowThreshold (integer): Low threshold for 'canny' structure (default: 100).
    • highThreshold (integer): High threshold for 'canny' structure (default: 200).
    • imageUpscaler (boolean): Enable image upscaling (default: false).
    • diffusionSteps (integer): Number of diffusion steps (default: 20).
    • negativePrompt (string): Aspects to avoid in the output (default specified).
    • numberOfSamples (integer): Number of output samples (default: 1, options: 1 or 4).
    • additionalPrompt (string): Additional details for the prompt (default specified).
    • autogeneratedPrompt (boolean): Indicates if the prompt should be auto-generated (default: false).
    • preprocessorResolution (integer): Resolution for preprocessing (default: 512).
Example Input
{
  "image": "https://replicate.delivery/pbxt/JREI44b9KCW78ynS9sH9je7wCckmEHcSF3EXwJBlhDhbh0jH/dog.png",
  "scale": 9,
  "prompt": "a dog in a bright sunshine jungle, hard lighting",
  "strength": 1,
  "structure": "canny",
  "lowThreshold": 100,
  "highThreshold": 200,
  "imageUpscaler": false,
  "diffusionSteps": 20,
  "negativePrompt": "Longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality",
  "numberOfSamples": 1,
  "additionalPrompt": "Best quality, extremely detailed",
  "autogeneratedPrompt": true,
  "preprocessorResolution": 512
}

Output

Upon successful execution, the action returns an array of URLs pointing to the generated captions. Here’s an example of the expected output:

[
  "https://assets.cognitiveactions.com/invocations/1df00ea6-65eb-464f-a2a0-cca04516dc04/20ebd8db-1c06-4602-942e-684756911528.png"
]

Conceptual Usage Example (Python)

Here’s how you might call the Cognitive Actions execution endpoint in Python to generate an image caption:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "27665a0a-52ce-40ab-bb90-e839dd3bf562" # Action ID for Generate Image-to-Text Caption with ControlNet

# Construct the input payload based on the action's requirements
payload = {
    "image": "https://replicate.delivery/pbxt/JREI44b9KCW78ynS9sH9je7wCckmEHcSF3EXwJBlhDhbh0jH/dog.png",
    "scale": 9,
    "prompt": "a dog in a bright sunshine jungle, hard lighting",
    "strength": 1,
    "structure": "canny",
    "lowThreshold": 100,
    "highThreshold": 200,
    "imageUpscaler": false,
    "diffusionSteps": 20,
    "negativePrompt": "Longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality",
    "numberOfSamples": 1,
    "additionalPrompt": "Best quality, extremely detailed",
    "autogeneratedPrompt": true,
    "preprocessorResolution": 512
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace the API key and endpoint with your actual values.
  • The action_id is set to the specific action you want to execute.
  • The payload is constructed according to the input schema, ensuring all required fields are included.

Conclusion

Integrating the Generate Image-to-Text Caption with ControlNet action into your applications can significantly enhance user experience by providing automatic, context-aware captions for images. By utilizing this powerful Cognitive Action, developers can streamline workflows, improve accessibility, and enrich content. Consider exploring further use cases or experimenting with different input parameters to maximize the potential of this API. Happy coding!