Enhance Your App with Visual-Controlled Text Generation Using yxuansu/magic Actions

21 Apr 2025
Enhance Your App with Visual-Controlled Text Generation Using yxuansu/magic Actions

Integrating visual elements into text generation can elevate your application’s interactivity and engagement levels. The yxuansu/magic API provides a robust framework that allows developers to harness the power of multimodal tasks, combining the capabilities of GPT-2 and CLIP. This innovative set of Cognitive Actions enables you to generate contextually relevant text based on visual inputs, enhancing functionalities like image captioning or story generation without the need for extensive training. In this article, we’ll explore how to implement one of the key actions available in this API.

Prerequisites

Before diving into the implementation, make sure you have:

  • An API key for the yxuansu/magic platform.
  • A basic understanding of JSON and RESTful API calls.
  • Python installed on your machine with the requests library.

Authentication typically involves passing your API key in the headers of your requests to access the Cognitive Actions.

Cognitive Actions Overview

Enable Visual-Controlled Text Generation

This action empowers developers to integrate visual controls into text generation tasks. By leveraging the synergy between image inputs and language models, users can create descriptive captions for images or generate stories based on visual prompts, enhancing the user experience significantly.

Category: Text Generation

Input

The input for this action requires the following fields:

  • imageUri (required): A valid URI pointing to the input image.
  • storyTitle (optional): The title used for generating a story when the 'selectedTask' is 'Story Generation'.
  • selectedTask (optional): Specifies the task to perform. It can be "Image Captioning" or "Story Generation". By default, it is set to "Image Captioning".

Example Input:

{
  "imageUri": "https://replicate.delivery/mgxm/35687157-a515-449d-979e-109c2f0c6149/COCO_val2014_000000516750.jpg",
  "selectedTask": "Image Captioning"
}

Output

Upon successful execution, the action returns an output containing the generated text based on the input image:

  • image_caption: The generated caption for the image.
  • magic_search_result: (Optional) Any search results from the MAGIC framework, typically null.
  • contrastive_search_result: (Optional) Any search results from contrastive methods, typically null.

Example Output:

{
  "image_caption": "A yellow boat is lined up on the beach.",
  "magic_search_result": null,
  "contrastive_search_result": null
}

Conceptual Usage Example (Python)

Here’s how you can use the Enable Visual-Controlled Text Generation action in your application:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "6673c0c3-d681-4ded-9823-6d4b3a560f93"  # Action ID for Enable Visual-Controlled Text Generation

# Construct the input payload based on the action's requirements
payload = {
    "imageUri": "https://replicate.delivery/mgxm/35687157-a515-449d-979e-109c2f0c6149/COCO_val2014_000000516750.jpg",
    "selectedTask": "Image Captioning"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, we define the action_id for the Enable Visual-Controlled Text Generation action and construct the input payload according to the required schema. The request is sent to the hypothetical endpoint, and the response is handled gracefully, ensuring that you can debug any issues effectively.

Conclusion

The yxuansu/magic Cognitive Action for visual-controlled text generation opens up exciting possibilities for developers looking to integrate advanced multimodal capabilities into their applications. By leveraging this action, you can create more engaging and context-aware experiences for your users. Consider exploring other use cases such as story generation to further enhance your application's interactivity. Happy coding!