Effortlessly Segment Audio by Speaker with lucataco/speaker-diarization Actions

22 Apr 2025
Effortlessly Segment Audio by Speaker with lucataco/speaker-diarization Actions

In the realm of audio processing, identifying who speaks when in recordings can be a daunting task. Enter the lucataco/speaker-diarization Cognitive Actions, designed to streamline the speaker diarization process. The primary action, Segment Audio by Speaker, leverages powerful GPU capabilities to dissect audio files and accurately attribute segments to different speakers. In this guide, we'll explore how to integrate this action into your applications, highlighting its benefits and providing clear examples.

Prerequisites

Before diving into the implementation, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • A basic understanding of how to make HTTP requests in your programming language of choice.

For authentication, you will typically need to pass your API key in the headers of your requests.

Cognitive Actions Overview

Segment Audio by Speaker

Description:
The Segment Audio by Speaker action segments an audio recording into distinct parts based on different speakers. It utilizes an A100 GPU, providing robust speaker diarization capabilities to identify who is speaking at various times during an audio file.

Category:
Speech Diarization

Input

The input for this action consists of a single required field:

  • audioFileUri (string): The URI of the audio file to be processed. Ensure that the URI is accessible and points to a valid audio format (e.g., MP3).

Example Input:

{
  "audioFileUri": "https://replicate.delivery/pbxt/JDG2I8TL3x7gAzn5rAswOcO8lpfQIUXnehBeUUiP59ZaL8oc/lex-levin-4min.mp3"
}

Output

Upon successful execution, the action returns a URL pointing to a JSON file containing the segmented audio data. This JSON typically includes the timestamps and speaker identifiers for the various segments.

Example Output:

https://assets.cognitiveactions.com/invocations/0b832ed7-0b2a-42c3-bc44-882755911e62/6086481d-d26e-4eed-b4bd-d6eb505949e5.json

Conceptual Usage Example (Python)

To illustrate how you might call this action, here's a conceptual Python code snippet that demonstrates structuring the input payload and making the API request:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "22790fb7-15ab-49aa-bd6e-66531e1310d5"  # Action ID for Segment Audio by Speaker

# Construct the input payload based on the action's requirements
payload = {
    "audioFileUri": "https://replicate.delivery/pbxt/JDG2I8TL3x7gAzn5rAswOcO8lpfQIUXnehBeUUiP59ZaL8oc/lex-levin-4min.mp3"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key.
  • The action_id corresponds to the Segment Audio by Speaker action.
  • The payload is structured according to the input schema requirements.
  • The response is printed in a structured format, allowing you to see the results of the action.

Conclusion

The lucataco/speaker-diarization Cognitive Actions offer a powerful and efficient way to implement speaker diarization in your applications. By utilizing the Segment Audio by Speaker action, developers can easily identify and segment audio recordings by speakers, enhancing the usability of audio data. To further enhance your application, consider exploring additional actions or integrating this functionality into larger audio processing workflows. Happy coding!