Enhance Audio Understanding with Speaker Diarization Using Cognitive Actions

25 Apr 2025
Enhance Audio Understanding with Speaker Diarization Using Cognitive Actions

In today’s digital landscape, audio analysis is becoming increasingly vital for applications ranging from customer service to media content creation. The Cognitive Actions from the collectiveai-team allow developers to leverage advanced audio processing capabilities effortlessly. This blog post will delve into the Speaker Diarization action, which enables applications to distinguish between different speakers in an audio file, thus enhancing audio understanding and information extraction.

Prerequisites

Before you start integrating the Cognitive Actions into your application, ensure you have the following:

  • An API key for the Cognitive Actions platform, which you'll need for authentication.
  • Basic knowledge of making HTTP requests and handling JSON data.

Authentication typically involves passing your API key in the request headers for secure access to the action endpoints.

Cognitive Actions Overview

Perform Speaker Diarization

The Perform Speaker Diarization action analyzes audio files to differentiate between various speakers, allowing for a better understanding of conversations, meetings, or any audio content involving multiple speakers.

Input

The input for this action requires a JSON object with a single property, audio, which is a URI pointing to the audio file you want to analyze.

Input Schema:

{
  "audio": "string"
}

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3"
}

Output

The output of this action is a JSON response that provides segments of the audio, indicating the start and stop times for each speaker along with their identifiers.

Example Output:

{
  "segments": [
    {
      "stop": "0:00:09.779063",
      "start": "0:00:00.497812",
      "speaker": "A"
    },
    {
      "stop": "0:00:02.168438",
      "start": "0:00:02.033437",
      "speaker": "B"
    },
    {
      "stop": "0:03:34.962188",
      "start": "0:00:09.863438",
      "speaker": "B"
    }
  ],
  "speakers": {
    "count": 2,
    "labels": [
      "A",
      "B"
    ],
    "embeddings": {
      "A": [ ... ],
      "B": [ ... ]
    }
  }
}

In this output, the segments array provides detailed time frames for each speaker, while the speakers object summarizes the total number of speakers and their unique identifiers.

Conceptual Usage Example (Python)

Here’s a conceptual example of how you might call the Perform Speaker Diarization action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "60eded05-6cb6-493d-a624-5e2a64f51201"  # Action ID for Perform Speaker Diarization

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key.
  • The action_id is set to the ID for the Perform Speaker Diarization action.
  • The payload is constructed to match the required input schema, and the request is made to a hypothetical endpoint.

Conclusion

Integrating the Speaker Diarization Cognitive Action into your application can significantly enhance your audio processing capabilities, enabling better understanding and analysis of conversations. With just a few lines of code, you can extract meaningful insights from audio files, paving the way for innovative applications in various domains.

As a next step, consider exploring more complex use cases involving transcription or sentiment analysis alongside speaker diarization for a comprehensive audio analysis solution.