Streamline Audio Analysis with Speaker Diarization Actions for CollectiveAI

23 Apr 2025
Streamline Audio Analysis with Speaker Diarization Actions for CollectiveAI

Integrating voice recognition capabilities into applications has become increasingly vital as audio content becomes more prevalent. The CollectiveAI Speaker Diarization API provides powerful Cognitive Actions that allow developers to analyze audio files and identify distinct speakers. By using these pre-built actions, developers can enhance their applications with features like speaker identification, segmentation, and improved audio processing, all while saving time and resources on building complex algorithms from scratch.

Prerequisites

Before you can start using the Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests. This key will typically be passed in the headers of your HTTP requests.
  • A valid audio file URL to test the Speaker Diarization action.

Cognitive Actions Overview

Perform Speaker Diarization

The Perform Speaker Diarization action analyzes an audio file to identify and label distinct speakers within the recording. It allows specifying a maximum, minimum, or exact number of speakers, facilitating precise speaker diarization.

Input

The input schema for this action requires the following fields:

  • audio (string, required): The URI of the audio file to process. This must be a valid URL pointing to an audio resource.
  • maximumSpeakers (integer, optional): Specifies the maximum number of distinct speakers allowed in the audio during diarization. This value must be at least 1.
  • minimumSpeakers (integer, optional): Specifies the minimum number of distinct speakers expected in the audio during diarization. This value must be at least 1.
  • numberOfSpeakers (integer, optional): The exact number of speakers if known, or let the system infer. Set to 'infer' by default.

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3"
}

Output

The output from the speaker diarization action includes the following:

  • segments: An array of segments with timestamps marking when each speaker is speaking.
  • speakers: An object containing:
    • count: The total number of speakers identified.
    • labels: An array of labels assigned to speakers.
    • embeddings: A mapping of speaker labels to their respective embeddings.

Example Output:

{
  "segments": [
    {
      "stop": "0:00:06.629881",
      "start": "0:00:00.008489",
      "speaker": "A"
    },
    {
      "stop": "0:00:22.555178",
      "start": "0:00:22.300509",
      "speaker": "B"
    }
  ],
  "speakers": {
    "count": 2,
    "labels": [
      "A",
      "B"
    ],
    "embeddings": {
      "A": [...],
      "B": []
    }
  }
}

Conceptual Usage Example (Python)

Here’s a conceptual Python code snippet that shows how to call the Perform Speaker Diarization action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "19879184-98d4-4264-8a8e-a079b645b65b"  # Action ID for Perform Speaker Diarization

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action_id corresponds to the Perform Speaker Diarization action. The input payload is structured based on the required fields, and we handle the response to show successful execution or any errors that occur.

Conclusion

The CollectiveAI Speaker Diarization Cognitive Actions provide developers with powerful tools to analyze audio data effectively. By leveraging these actions, you can enhance your applications with features that recognize and label speakers, thereby making your audio content more accessible and usable. Explore integrating this action into your projects, and consider how it can enhance your user experiences with audio analysis capabilities.