Enhance Your App with Vision Capabilities Using Florence-2 Cognitive Actions

In today’s fast-paced technological landscape, integrating advanced vision capabilities can significantly enhance user experiences in applications. The lucataco/florence-2-large API offers a powerful set of Cognitive Actions that utilize the Florence-2 model to perform various vision tasks such as captioning, object detection, and OCR (Optical Character Recognition). These pre-built actions not only streamline the integration process but also improve speed, quality, and accuracy for multi-task learning.
Prerequisites
Before diving into the integration of Cognitive Actions, you'll need to ensure you have the following:
- An API key for accessing the Cognitive Actions platform.
- Basic knowledge of making HTTP requests and working with JSON.
Authentication typically involves passing your API key in the headers of your requests to ensure secure access to the service.
Cognitive Actions Overview
Perform Vision Task with Florence-2
The Perform Vision Task with Florence-2 action allows developers to leverage the capabilities of the Florence-2 model for a variety of vision tasks. This action supports different tasks, including captioning and object detection, by using a unified representation that excels in both zero-shot and fine-tuned scenarios.
Input
The input for this action requires the following fields:
- image (required): A URI pointing to the grayscale input image that needs processing.
- taskInput (optional): Specifies the type of task to perform on the image. The available options include:
- Caption
- Detailed Caption
- More Detailed Caption
- Caption to Phrase Grounding
- Object Detection
- Dense Region Caption
- Region Proposal
- OCR
- OCR with Region
- textInput (optional): An additional text input to accompany the task if applicable.
Example Input:
{
"image": "https://replicate.delivery/pbxt/L9zDhV2KiVnudUyRiNjt9P18LZ98Hrqq5GGdx9szmBCAyEhP/car.jpg",
"taskInput": "Object Detection"
}
Output
The output from this action typically includes:
- img: A URI to the processed output image.
- text: A structured representation of detected objects, including bounding boxes and labels.
Example Output:
{
"img": "https://assets.cognitiveactions.com/invocations/e97290f1-1a0d-44f0-bb18-501ad995368a/7d8a5387-e71a-46a1-b9d3-326c07e1d7e6.png",
"text": "{'<OD>': {'bboxes': [[33.599998474121094, 160.55999755859375, 596.7999877929688, 371.7599792480469], [271.67999267578125, 242.1599884033203, 302.3999938964844, 246.95999145507812]], 'labels': ['car', 'door handle']}}"
}
Conceptual Usage Example (Python)
Here’s a conceptual Python code snippet demonstrating how you might invoke this action:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "cf021c7b-9a8d-422d-a713-4d6b93fc5d93" # Action ID for Perform Vision Task with Florence-2
# Construct the input payload based on the action's requirements
payload = {
"image": "https://replicate.delivery/pbxt/L9zDhV2KiVnudUyRiNjt9P18LZ98Hrqq5GGdx9szmBCAyEhP/car.jpg",
"taskInput": "Object Detection"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this example, you replace the API key and endpoint with your actual credentials. The action ID and input payload are structured to match the requirements of the Perform Vision Task with Florence-2 action.
Conclusion
The lucataco/florence-2-large Cognitive Actions provide robust capabilities for integrating vision tasks into applications, enhancing user interaction through features like object detection and captioning. By leveraging these pre-built actions, developers can save time and resources while achieving high-quality results. As next steps, consider experimenting with different tasks or combining multiple actions to create a comprehensive solution tailored to your application needs.