Unlocking Speaker Insights: Integrate Diarization with eaa/diarisation Cognitive Actions

In today's world of audio processing, understanding who is speaking and when can be crucial for various applications, from transcription services to meeting analysis. The eaa/diarisation Cognitive Actions offer a powerful solution for performing speaker diarization. This operation identifies and segments individual speakers within an audio file, allowing developers to enhance their applications with sophisticated audio analysis capabilities. Utilizing pre-built actions, developers can save time while integrating complex functionalities into their projects.
Prerequisites
Before you start integrating the eaa/diarisation Cognitive Actions, ensure you have the following:
- An API key for the Cognitive Actions platform.
- Basic understanding of JSON structure and Python programming.
- A valid audio file in WAV format accessible via a URI.
To authenticate your requests, you will typically send your API key in the request headers. This ensures that your application can securely communicate with the Cognitive Actions service.
Cognitive Actions Overview
Perform Speaker Diarization
The Perform Speaker Diarization action is designed to analyze audio files and differentiate between various speakers. It requires a URI pointing to the audio file and a JSON string that specifies the time segments for analysis.
Input
The input for this action consists of two required fields:
- audio: A URI pointing to the input audio file in WAV format.
- jsonRecords: A JSON string that contains an array of records, each specifying
startanddurationattributes, defining segments of interest within the audio file.
Example Input:
{
"audio": "https://replicate.delivery/pbxt/K2w0v2lAkIdd96nYaESUJ7EAHQvQ51QhdAj8MJeNsxdmbM7p/sound_ac1_ar16K.wav",
"jsonRecords": "[{\"start\":0.84,\"duration\":0.56},{\"start\":1.92,\"duration\":0.52},{\"start\":3.92,\"duration\":0.48},{\"start\":4.76,\"duration\":0.56},{\"start\":6.44,\"duration\":1.52},{\"start\":9.4,\"duration\":1.56},{\"start\":11.88,\"duration\":0.56},{\"start\":12.48,\"duration\":2.32},{\"start\":18.64,\"duration\":0.84},{\"start\":21.2,\"duration\":0.4},{\"start\":22.36,\"duration\":2.96},{\"start\":25.36,\"duration\":0.68},{\"start\":26.28,\"duration\":3.88},{\"start\":30.6,\"duration\":5.08},{\"start\":41.34,\"duration\":3.6},{\"start\":47.38,\"duration\":0.84},{\"start\":49.9,\"duration\":3.32}]"
}
Output
The output of this action is a nested array, where each sub-array represents segments that correspond to different speakers identified in the audio. For example, an output might look like:
[
[0],
[1, 2, 3, 4, 5, 13, 6, 7, 10, 11, 12, 14, 16],
[8],
[9],
[15]
]
This indicates which segments of the audio belong to each identified speaker.
Conceptual Usage Example (Python)
Below is a conceptual Python code snippet demonstrating how to call the Perform Speaker Diarization action.
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "422dc86d-7068-44fe-ac68-892ae33faa1a" # Action ID for Perform Speaker Diarization
# Construct the input payload based on the action's requirements
payload = {
"audio": "https://replicate.delivery/pbxt/K2w0v2lAkIdd96nYaESUJ7EAHQvQ51QhdAj8MJeNsxdmbM7p/sound_ac1_ar16K.wav",
"jsonRecords": "[{\"start\":0.84,\"duration\":0.56},{\"start\":1.92,\"duration\":0.52},{\"start\":3.92,\"duration\":0.48},{\"start\":4.76,\"duration\":0.56},{\"start\":6.44,\"duration\":1.52},{\"start\":9.4,\"duration\":1.56},{\"start\":11.88,\"duration\":0.56},{\"start\":12.48,\"duration\":2.32},{\"start\":18.64,\"duration\":0.84},{\"start\":21.2,\"duration\":0.4},{\"start\":22.36,\"duration\":2.96},{\"start\":25.36,\"duration\":0.68},{\"start\":26.28,\"duration\":3.88},{\"start\":30.6,\"duration\":5.08},{\"start\":41.34,\"duration\":3.6},{\"start\":47.38,\"duration\":0.84},{\"start\":49.9,\"duration\":3.32}]"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code:
- Replace the
COGNITIVE_ACTIONS_API_KEYand the endpoint with your actual API key and endpoint. - The
action_idis set to the ID of the Perform Speaker Diarization action. - The
payloadis structured according to the input schema requirements.
Conclusion
The eaa/diarisation Cognitive Actions provide developers with a streamlined way to implement speaker diarization in their applications. By leveraging these actions, you can enhance your audio processing capabilities, making it easier to analyze and understand conversations. Whether you're building a transcription service or enhancing user interactions in applications, these tools offer flexibility and efficiency.
As you explore further, consider how integrating additional audio processing features could elevate your application even more!