Unlocking Speaker Insights: Integrate Diarization with eaa/diarisation Cognitive Actions

24 Apr 2025
Unlocking Speaker Insights: Integrate Diarization with eaa/diarisation Cognitive Actions

In today's world of audio processing, understanding who is speaking and when can be crucial for various applications, from transcription services to meeting analysis. The eaa/diarisation Cognitive Actions offer a powerful solution for performing speaker diarization. This operation identifies and segments individual speakers within an audio file, allowing developers to enhance their applications with sophisticated audio analysis capabilities. Utilizing pre-built actions, developers can save time while integrating complex functionalities into their projects.

Prerequisites

Before you start integrating the eaa/diarisation Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic understanding of JSON structure and Python programming.
  • A valid audio file in WAV format accessible via a URI.

To authenticate your requests, you will typically send your API key in the request headers. This ensures that your application can securely communicate with the Cognitive Actions service.

Cognitive Actions Overview

Perform Speaker Diarization

The Perform Speaker Diarization action is designed to analyze audio files and differentiate between various speakers. It requires a URI pointing to the audio file and a JSON string that specifies the time segments for analysis.

Input

The input for this action consists of two required fields:

  • audio: A URI pointing to the input audio file in WAV format.
  • jsonRecords: A JSON string that contains an array of records, each specifying start and duration attributes, defining segments of interest within the audio file.

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/K2w0v2lAkIdd96nYaESUJ7EAHQvQ51QhdAj8MJeNsxdmbM7p/sound_ac1_ar16K.wav",
  "jsonRecords": "[{\"start\":0.84,\"duration\":0.56},{\"start\":1.92,\"duration\":0.52},{\"start\":3.92,\"duration\":0.48},{\"start\":4.76,\"duration\":0.56},{\"start\":6.44,\"duration\":1.52},{\"start\":9.4,\"duration\":1.56},{\"start\":11.88,\"duration\":0.56},{\"start\":12.48,\"duration\":2.32},{\"start\":18.64,\"duration\":0.84},{\"start\":21.2,\"duration\":0.4},{\"start\":22.36,\"duration\":2.96},{\"start\":25.36,\"duration\":0.68},{\"start\":26.28,\"duration\":3.88},{\"start\":30.6,\"duration\":5.08},{\"start\":41.34,\"duration\":3.6},{\"start\":47.38,\"duration\":0.84},{\"start\":49.9,\"duration\":3.32}]"
}

Output

The output of this action is a nested array, where each sub-array represents segments that correspond to different speakers identified in the audio. For example, an output might look like:

[
  [0],
  [1, 2, 3, 4, 5, 13, 6, 7, 10, 11, 12, 14, 16],
  [8],
  [9],
  [15]
]

This indicates which segments of the audio belong to each identified speaker.

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet demonstrating how to call the Perform Speaker Diarization action.

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "422dc86d-7068-44fe-ac68-892ae33faa1a" # Action ID for Perform Speaker Diarization

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/K2w0v2lAkIdd96nYaESUJ7EAHQvQ51QhdAj8MJeNsxdmbM7p/sound_ac1_ar16K.wav",
    "jsonRecords": "[{\"start\":0.84,\"duration\":0.56},{\"start\":1.92,\"duration\":0.52},{\"start\":3.92,\"duration\":0.48},{\"start\":4.76,\"duration\":0.56},{\"start\":6.44,\"duration\":1.52},{\"start\":9.4,\"duration\":1.56},{\"start\":11.88,\"duration\":0.56},{\"start\":12.48,\"duration\":2.32},{\"start\":18.64,\"duration\":0.84},{\"start\":21.2,\"duration\":0.4},{\"start\":22.36,\"duration\":2.96},{\"start\":25.36,\"duration\":0.68},{\"start\":26.28,\"duration\":3.88},{\"start\":30.6,\"duration\":5.08},{\"start\":41.34,\"duration\":3.6},{\"start\":47.38,\"duration\":0.84},{\"start\":49.9,\"duration\":3.32}]"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code:

  • Replace the COGNITIVE_ACTIONS_API_KEY and the endpoint with your actual API key and endpoint.
  • The action_id is set to the ID of the Perform Speaker Diarization action.
  • The payload is structured according to the input schema requirements.

Conclusion

The eaa/diarisation Cognitive Actions provide developers with a streamlined way to implement speaker diarization in their applications. By leveraging these actions, you can enhance your audio processing capabilities, making it easier to analyze and understand conversations. Whether you're building a transcription service or enhancing user interactions in applications, these tools offer flexibility and efficiency.

As you explore further, consider how integrating additional audio processing features could elevate your application even more!