Unlocking Image and Text Similarity with CLIP Features Cognitive Actions

25 Apr 2025
Unlocking Image and Text Similarity with CLIP Features Cognitive Actions

In the realm of machine learning, the ability to analyze and compare text and images is crucial for numerous applications, from content classification to enhanced user experiences. The andreasjansson/clip-features API provides a powerful toolset through its Generate CLIP Features action, which leverages the CLIP model (clip-vit-large-patch14) for feature extraction. This allows developers to implement functionality such as similarity checking between textual descriptions and corresponding images seamlessly.

Prerequisites

Before diving into the integration of the Cognitive Actions, ensure that you have the following prerequisites:

  • Access to the Cognitive Actions platform with an API key.
  • A basic understanding of RESTful API calls and JSON structure.
  • Install the requests library in Python to facilitate API requests.

Authentication typically involves passing your API key in the request headers, ensuring that your application has the necessary permissions to access the actions.

Cognitive Actions Overview

Generate CLIP Features

This action generates features using the CLIP model, enabling the analysis and comparison of text and image inputs. It is particularly useful for applications that require similarity checks between descriptions and visual content.

  • Category: Image Analysis

Input

The action requires a structured input consisting of newline-separated items, which can include either text descriptions or image URIs.

Input Schema:

{
  "inputItems": "string" // Newline-separated list of inputs
}

Example Input:

{
  "inputItems": "a photo of a dog\na cat\ntwo cats with remote controls\nhttps://replicate.com/api/models/cjwbw/clip-vit-large-patch14/files/36b04aec-efe2-4dea-9c9d-a5faca68b2b2/000000039769.jpg"
}

Output

The action returns an array of embeddings corresponding to each input item. Each embedding is a numerical representation that facilitates similarity comparisons.

Example Output:

[
  {
    "input": "a photo of a dog",
    "embedding": [0.16265869140625, -0.06995393335819244, ...]
  },
  {
    "input": "a cat",
    "embedding": [0.08319822698831558, -0.027810707688331604, ...]
  },
  ...
]

Conceptual Usage Example (Python)

Here's how you might call the Generate CLIP Features action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "3a840cf0-a0c6-4fd8-903c-c3f62c9ab541"  # Action ID for Generate CLIP Features

# Construct the input payload based on the action's requirements
payload = {
    "inputItems": "a photo of a dog\na cat\ntwo cats with remote controls\nhttps://replicate.com/api/models/cjwbw/clip-vit-large-patch14/files/36b04aec-efe2-4dea-9c9d-a5faca68b2b2/000000039769.jpg"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, you will need to replace the placeholder for the API key and the endpoint with your actual values. The action ID and the input payload must be structured correctly to ensure successful execution.

Conclusion

The Generate CLIP Features action from the andreasjansson/clip-features API empowers developers to easily analyze and compare text and image inputs, facilitating enhanced user experiences and functionalities in various applications. By integrating this action, you can unlock the potential for advanced image and text similarity checks within your app.

As a next step, consider exploring further use cases such as content-based image retrieval or automated tagging systems to fully leverage the capabilities of the CLIP model.