Generate Synchronized Subtitles Easily with WhisperX Cognitive Actions

24 Apr 2025
Generate Synchronized Subtitles Easily with WhisperX Cognitive Actions

Creating subtitles for audio content can be a daunting task, especially when striving for accuracy and synchronization. The dashed/whisperx-subtitles-replicate API provides a powerful solution through its Cognitive Action: Generate Subtitles from Audio. This action leverages the WhisperX model to automatically produce synchronized SRT subtitles from audio files, making it an essential tool for developers looking to enhance their applications with subtitle generation capabilities.

Prerequisites

Before diving into the integration of Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic familiarity with JSON and API calls.
  • A suitable development environment set up for making HTTP requests (e.g., Python with the requests library).

For authentication, you will typically pass your API key in the headers of your requests.

Cognitive Actions Overview

Generate Subtitles from Audio

This action creates synchronized SRT subtitles from audio files. It utilizes the WhisperX model to transcribe audio data, generating word-level timestamps and properly formatted subtitles.

Category: Automatic Subtitle Generation

Input

The input for this action requires a JSON object structured according to the following schema:

{
  "audioUri": "string (required)",
  "debug": "boolean (optional, default: false)",
  "language": "string (optional, e.g., 'en')",
  "vadOnset": "number (optional, default: 0.5)",
  "batchSize": "integer (optional, default: 64)",
  "vadOffset": "number (optional, default: 0.363)",
  "alignOutput": "boolean (optional, default: true)",
  "diarization": "boolean (optional, default: false)",
  "temperature": "number (optional, default: 0)",
  "maximumSpeakers": "integer (optional, for diarization)",
  "minimumSpeakers": "integer (optional, for diarization)",
  "initialPromptText": "string (optional)",
  "huggingfaceAccessToken": "string (optional, for diarization)",
  "languageDetectionMaxAttempts": "integer (optional, default: 5)",
  "languageDetectionMinProbability": "number (optional, default: 0)"
}

Example Input:

{
  "debug": false,
  "audioUri": "https://replicate.delivery/pbxt/Lj3F130lpxtKYucKpix3vQ9NDnoUGcoZByurQhgjGmIZ4As1/The_Greatest_Speech_Ever_Made_-_Original.m4a",
  "language": "en",
  "vadOnset": 0.5,
  "batchSize": 64,
  "vadOffset": 0.363,
  "alignOutput": true,
  "diarization": false,
  "temperature": 0,
  "languageDetectionMaxAttempts": 5,
  "languageDetectionMinProbability": 0
}

Output

The output of this action is a JSON object containing the segments of subtitles generated from the audio. Here is an example of what the output might look like:

{
  "segments": [
    {
      "start": 6.636,
      "end": 9.677,
      "text": "I'm sorry, but I don't want to be an emperor.",
      "words": [
        {"word": "I'm", "start": 6.636, "end": 6.776, "score": 0.884},
        {"word": "sorry,", "start": 6.836, "end": 7.116, "score": 0.909}
        // Additional words...
      ]
    }
    // Additional segments...
  ],
  "srt_file": "URL to the generated SRT file",
  "detected_language": "en"
}

Conceptual Usage Example (Python)

Here's a conceptual example of how a developer might call this action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "de40e552-449c-400d-92f7-d2cbf1984976"  # Action ID for Generate Subtitles from Audio

# Construct the input payload based on the action's requirements
payload = {
    "debug": false,
    "audioUri": "https://replicate.delivery/pbxt/Lj3F130lpxtKYucKpix3vQ9NDnoUGcoZByurQhgjGmIZ4As1/The_Greatest_Speech_Ever_Made_-_Original.m4a",
    "language": "en",
    "vadOnset": 0.5,
    "batchSize": 64,
    "vadOffset": 0.363,
    "alignOutput": true,
    "diarization": false,
    "temperature": 0,
    "languageDetectionMaxAttempts": 5,
    "languageDetectionMinProbability": 0
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, the developer constructs the input payload according to the schema and sends it to the hypothetical endpoint. The response is printed out, showcasing the generated subtitles.

Conclusion

The Generate Subtitles from Audio action under the dashed/whisperx-subtitles-replicate API offers developers a seamless way to automate the generation of synchronized subtitles from audio files. By integrating this powerful Cognitive Action into your applications, you can enhance accessibility and user engagement significantly. Consider exploring more use cases or combining this action with other features to create a robust multimedia experience!