Unlocking Image Processing Capabilities with Florence 2 Base Cognitive Actions

24 Apr 2025
Unlocking Image Processing Capabilities with Florence 2 Base Cognitive Actions

In the world of AI-driven applications, leveraging the power of image processing can significantly enhance user experience and functionality. The Florence 2 Base Cognitive Actions provide developers with a suite of pre-built capabilities for various image-related tasks such as captioning, object detection, and optical character recognition (OCR). By integrating these actions into your applications, you can automate and streamline visual processing tasks, improving both speed and accuracy.

Prerequisites

Before you start using the Florence 2 Base Cognitive Actions, make sure you have the following in place:

  • API Key: You will need an API key to access the Cognitive Actions platform.
  • Setup: Familiarize yourself with how to pass this API key through the request headers for authentication.

Typically, authentication can be accomplished by including the API key in the headers of your requests.

Cognitive Actions Overview

Process Image with Florence 2 Base

The Process Image with Florence 2 Base action utilizes Microsoft's Florence 2 Base model to perform various image-related tasks. This includes captioning images, detecting objects, and executing OCR, among others. The model is designed for speed, quality, and accuracy, accepting images via a URL and executing the specified task.

  • Category: Image Processing

Input

The input for this action requires the following fields:

  • image (string, required): The URI of the input image. It must be a valid URL.
  • taskInput (string, optional): The desired visual processing task. This defaults to "Caption". Possible values include:
    • Caption
    • Detailed Caption
    • More Detailed Caption
    • Caption to Phrase Grounding
    • Object Detection
    • Dense Region Caption
    • Region Proposal
    • Referring Expression Segmentation
    • Region to Segmentation
    • Open Vocabulary Detection
    • Region to Category
    • Region to Description
    • OCR
    • OCR with Region
  • textInput (string, optional): Related text input for the task. For example, "hat".

Example Input:

{
  "image": "https://replicate.delivery/pbxt/MFjMxRSTlcth21sq6JnvWldj2v1ecm7S2PZpUZsn20PTM58L/download%20%2810%29.jpeg",
  "taskInput": "Referring Expression Segmentation",
  "textInput": "hat"
}

Output

The result of this action typically returns a JSON object that includes:

  • img: This would contain any processed image data (null in this example).
  • text: A JSON structure with the results of the processing task, including polygons and labels related to the task.

Example Output:

{
  "img": null,
  "text": "{'<REFERRING_EXPRESSION_SEGMENTATION>': {'polygons': [[[1430.528076171875, 770.54248046875, ...]]], 'labels': ['']}}"
}

Conceptual Usage Example (Python)

Here’s a conceptual Python snippet demonstrating how you might invoke the Process Image with Florence 2 Base action through a generic Cognitive Actions endpoint:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "609952da-75d2-4da8-bcc1-682354dfebd7"  # Action ID for Process Image with Florence 2 Base

# Construct the input payload based on the action's requirements
payload = {
    "image": "https://replicate.delivery/pbxt/MFjMxRSTlcth21sq6JnvWldj2v1ecm7S2PZpUZsn20PTM58L/download%20%2810%29.jpeg",
    "taskInput": "Referring Expression Segmentation",
    "textInput": "hat"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace the placeholder API key and URL with your actual values. The payload is structured according to the required input schema, and the request is sent to the hypothetical endpoint.

Conclusion

The Florence 2 Base Cognitive Actions provide a powerful toolset for developers looking to enhance their applications with image processing capabilities. By integrating these actions, you can automate complex visual tasks, making your applications more intelligent and responsive. Explore additional use cases such as real-time object detection or advanced OCR to fully leverage the potential of these Cognitive Actions in your projects.