Efficiently Transcribe Large Audio Files with WhisperX Cognitive Actions

22 Apr 2025
Efficiently Transcribe Large Audio Files with WhisperX Cognitive Actions

In the world of audio processing, extracting meaningful content from audio files can be a challenging task. The WhisperX Cognitive Actions provide developers with powerful tools to efficiently transcribe large audio files. Leveraging advanced capabilities like word-level timestamps and speaker diarization, these actions optimize the transcription process while ensuring high accuracy and speed.

Prerequisites

Before diving into the integration of WhisperX Cognitive Actions, ensure that you have the following prerequisites:

  • An API key for the Cognitive Actions platform, which you will use for authentication.
  • Basic familiarity with making API requests, particularly the ability to send JSON payloads.

Conceptually, you'll need to pass your API key in the request headers when invoking the actions.

Cognitive Actions Overview

Transcribe Large Audio Files

The Transcribe Large Audio Files action enables you to efficiently transcribe audio files using the WhisperX large-v3 model. It supports features such as word-level timestamps and speaker diarization to enhance the accuracy of your transcriptions.

  • Category: Audio Transcription

Input

The input for this action requires an object with the following schema:

{
  "audioFile": "https://example.com/audio.wav",
  "debug": false,
  "language": "None",
  "batchSize": 64,
  "alignOutput": false,
  "diarization": false,
  "temperature": 0,
  "initialPrompt": "",
  "maximumSpeakers": null,
  "minimumSpeakers": null,
  "huggingFaceAccessToken": "",
  "voiceActivityDetectionOnset": 0.5,
  "voiceActivityDetectionOffset": 0.363,
  "languageDetectionMaximumTries": 5,
  "languageDetectionMinimumProbability": 0
}
  • Required Field:
    • audioFile: The URI of the audio file to be processed.
  • Optional Fields:
    • debug: Enables debug information.
    • language: Specify the spoken language or use 'None' for auto-detection.
    • batchSize: Number of audio inputs to process in parallel (default: 64).
    • alignOutput: If true, aligns transcription output for word-level timestamps.
    • diarization: Enables speaker diarization.
    • temperature: Controls output randomness (default: 0).
    • voiceActivityDetectionOnset: VAD onset threshold (default: 0.5).
    • voiceActivityDetectionOffset: VAD offset threshold (default: 0.363).
    • languageDetectionMaximumTries: Max attempts for language detection (default: 5).
    • languageDetectionMinimumProbability: Minimum probability for language detection (default: 0).

Example Input

Here’s a practical example of the JSON payload needed to invoke this action:

{
  "debug": false,
  "audioFile": "https://replicate.delivery/pbxt/JrckTmbaACSq83MQ5IW8E85b2NPUWZYpCyvxD7A836I5j21G/OSR_uk_000_0050_8k.wav",
  "batchSize": 64,
  "alignOutput": false,
  "diarization": false,
  "temperature": 0,
  "voiceActivityDetectionOnset": 0.5,
  "voiceActivityDetectionOffset": 0.363
}

Output

Upon successful execution, the action returns an output containing the transcription segments and the detected language. Here’s an example output structure:

{
  "segments": [
    {
      "end": 30.811,
      "text": "The little tales they tell are false..."
      "start": 2.585
    },
    {
      "end": 48.592,
      "text": "The room was crowded with a wild mob..."
      "start": 33.029
    }
  ],
  "detected_language": "en"
}
  • Output Fields:
    • segments: An array of transcription segments with start, end, and text.
    • detected_language: The language detected from the audio file.

Conceptual Usage Example (Python)

Here’s how you might call the Transcribe Large Audio Files action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "04a1ce9f-257f-457d-a24d-8df4319e6453"  # Action ID for Transcribe Large Audio Files

# Construct the input payload based on the action's requirements
payload = {
  "debug": false,
  "audioFile": "https://replicate.delivery/pbxt/JrckTmbaACSq83MQ5IW8E85b2NPUWZYpCyvxD7A836I5j21G/OSR_uk_000_0050_8k.wav",
  "batchSize": 64,
  "alignOutput": false,
  "diarization": false,
  "temperature": 0,
  "voiceActivityDetectionOnset": 0.5,
  "voiceActivityDetectionOffset": 0.363
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key.
  • The action_id is set to the ID for the transcription action.
  • The payload is constructed to meet the input requirements of the action.

Conclusion

The WhisperX Cognitive Actions provide a powerful means to transcribe large audio files with ease and precision. By utilizing features like speaker diarization and word-level timestamps, developers can enhance their applications' audio processing capabilities. Consider experimenting with these actions in your projects to streamline audio transcription workflows and unlock valuable insights from audio data.