Enhance Your Applications with Speech Recognition and Diarization

25 Apr 2025
Enhance Your Applications with Speech Recognition and Diarization

In today's fast-paced digital landscape, integrating advanced speech recognition capabilities into your applications can significantly enhance user experiences and streamline processes. Whisperx offers powerful Cognitive Actions designed to perform automatic speech recognition (ASR) with added features like timestamps and speaker diarization. This functionality not only transcribes audio files into text but also distinguishes between different speakers, making it an invaluable tool for various applications.

Imagine automating transcription services for podcasts, meetings, or interviews, where clarity and speaker identification are crucial. With Whisperx, developers can leverage multi-language support, making it ideal for global applications. The ability to provide detailed outputs, including word-level timestamps and speaker differentiation, allows for comprehensive speech analysis that can be applied in customer service, media production, and educational sectors.

Prerequisites

To get started with Whisperx, you will need a Cognitive Actions API key and a basic understanding of making API calls.

Perform Speech Recognition with Timestamps and Diarization

This operation performs automatic speech recognition on audio files, providing word-level timestamps and speaker diarization. Its ability to identify multiple speakers within a conversation enhances the quality of speech analysis by offering detailed outputs tailored to various languages.

Input Requirements

The input for this action requires an accessible URI pointing to the audio file. Additionally, you can specify options for debugging, speaker identification (diarization), the language of the audio, batch size, and the number of speakers to detect.

  • Audio: A URI pointing to the audio file (e.g., https://replicate.delivery/pbxt/K3BGhaLBJ3nhPfDteXbTA8xIuQvC5dR3wViyiX0OKuzrVJ6f/erium.wav).
  • Debug: (optional) Enables verbose output, default is false.
  • Diarize: (optional) Enables speaker identification, default is false.
  • Language: (optional) Specifies the language model for analysis, default is German.
  • Batch Size: (optional) Specifies the number of batches to run in parallel, default is 32.
  • Max Speakers: (optional) Sets the upper limit on detected speakers.
  • Min Speakers: (optional) Sets the lower limit on detected speakers.

Expected Output

The output of this action includes a structured list of recognized speech segments, each with associated timestamps, speaker identification, and the transcribed text.

Example output:

[
  {
    "end": 10.742,
    "text": " Ihr hört die IRIUM Podcast, der Data Science und Machine Learning Podcast für Young Professionals und Studienabsolventen, die wirklich wissen wollen, was in der Arbeitswelt abgeht.",
    "start": 0.009,
    "speaker": "SPEAKER_00"
  }
]

Use Cases for this Specific Action

  • Transcription Services: Automatically transcribe interviews, meetings, or webinars, providing clients with accurate records.
  • Podcast Production: Streamline content creation by generating transcripts for podcasts, enhancing accessibility for hearing-impaired audiences.
  • Customer Service: Analyze customer interactions in call centers to improve service quality and training programs.
  • Educational Tools: Create study materials from recorded lectures, enabling students to review content efficiently.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "83425dde-7089-4b3d-995d-eabd23feeaa4" # Action ID for: Perform Speech Recognition with Timestamps and Diarization

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "audio": "https://replicate.delivery/pbxt/K3BGhaLBJ3nhPfDteXbTA8xIuQvC5dR3wViyiX0OKuzrVJ6f/erium.wav",
  "debug": false,
  "diarize": true,
  "language": "de",
  "batchSize": 32
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

Whisperx's speech recognition with timestamps and diarization opens up a world of possibilities for developers looking to enhance their applications with audio analysis capabilities. By leveraging this technology, you can improve user engagement, accessibility, and content management across various domains. As you embark on integrating these Cognitive Actions, consider the diverse applications and efficiencies they can bring to your projects. Explore how Whisperx can transform your audio processing tasks into seamless, automated workflows.