Efficient Speaker Transcription with Diarization for Developers

26 Apr 2025
Efficient Speaker Transcription with Diarization for Developers

Transcribing audio files has become an essential task in various fields, from content creation and legal documentation to academic research. The "Speaker Transcription" service provides developers with the tools to automate this process, making it faster and more accurate. By leveraging advanced AI models, this service offers not just transcription but also speaker diarization, allowing you to identify and label different speakers in the audio. This capability simplifies the analysis of interviews, meetings, and discussions, enabling you to extract meaningful insights quickly and efficiently.

Prerequisites

To use the Speaker Transcription service, you'll need an API key for Cognitive Actions and a basic understanding of how to make API calls.

Perform Whisper Transcription with Speaker Diarization

This action is designed to transcribe English audio files while simultaneously identifying and segmenting different speakers. By utilizing OpenAI's Whisper model for transcription and a pre-trained diarization pipeline from pyannote.audio, this action outputs detailed information, including speaker labels, timestamps, and speaker embeddings.

Purpose

The primary goal of this action is to convert spoken language in audio files into written text, while also distinguishing between different speakers. This dual functionality helps users better understand the context and contributions of each speaker in a conversation.

Input Requirements

The input for this action requires an audio file accessible via a URI. The audio should be in a format that can be processed by the service. An optional prompt can also be provided to influence the transcription output.

  • Audio: A URI pointing to the audio file (e.g., https://replicate.delivery/pbxt/IZruuPAVCQh1lI25MIihRwFHN4MvjH7xcBTgnbXUDM1CAY7m/lex-levin-4min.mp3)
  • Prompt: (Optional) A text prompt for influencing the model's response.

Expected Output

The output will be a JSON object containing the transcription results, which includes:

  • Detected speakers
  • Timestamps for when each speaker speaks
  • Speaker embeddings

Use Cases for this Specific Action

  1. Interviews and Focus Groups: Researchers can use this action to transcribe interviews, making it easier to analyze discussions and extract quotes attributed to specific participants.
  2. Meeting Minutes: Businesses can automate the process of creating meeting minutes by transcribing discussions and identifying who said what, streamlining documentation.
  3. Podcasts and Webinars: Content creators can enhance their accessibility by providing transcriptions of spoken content, thereby reaching a wider audience.

```python
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "b718176e-b980-40db-8de5-8976026a58f8" # Action ID for: Perform Whisper Transcription with Speaker Diarization

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "audio": "https://replicate.delivery/pbxt/IZruuPAVCQh1lI25MIihRwFHN4MvjH7xcBTgnbXUDM1CAY7m/lex-levin-4min.mp3"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")


### Conclusion
The Speaker Transcription service with diarization capabilities offers a powerful solution for developers looking to automate audio transcription. By providing detailed speaker identification and timestamps, it enhances the usability of transcriptions in various applications. Whether you're working on academic research, business documentation, or content creation, this service can save you time and improve the quality of your outputs. 

As a next step, consider integrating this action into your applications to streamline your transcription needs and enhance user engagement with accurate and accessible content.