Accelerate Audio Transcription with WhisperX Cognitive Actions

In the rapidly evolving digital landscape, the demand for efficient audio transcription solutions is on the rise. The WhisperX Cognitive Actions offer developers a powerful API for fast and accurate audio transcription, complete with features like word-level timestamps and speaker diarization. This article will guide you through the capabilities of the WhisperX action, helping you integrate seamless audio transcription into your applications.
Prerequisites
Before you start using the WhisperX Cognitive Actions, ensure you have:
- An API key for the Cognitive Actions platform.
- Familiarity with making HTTP requests and handling JSON data.
- An accessible audio file in a valid URI format.
Authentication typically involves passing your API key in the headers of your requests.
Cognitive Actions Overview
Transcribe Audio with WhisperX
Description: This action accelerates audio transcription using the WhisperX large-v3 model. It provides accurate transcriptions of audio files with features such as word-level timestamps and speaker diarization. The action supports audio files that are a few hours long and under a couple of hundred MB in size.
Category: Audio Transcription
Input
The input for this action requires the following fields, along with several optional settings:
- audioFile (required): URL of the audio file to be transcribed.
- debug (optional): Indicates if debug information should be printed. Default is
false. - language (optional): ISO 639-1 code for the spoken language; use 'None' for automatic detection.
- vadOnset (optional): Voice Activity Detection onset threshold. Default is
0.5. - batchSize (optional): Number of audio segments processed in parallel. Default is
64. - vadOffset (optional): Voice Activity Detection offset threshold. Default is
0.363. - alignOutput (optional): Align model output for accurate timestamps. Default is
false. - diarization (optional): Assign speaker labels. Default is
false. - maxSpeakers (optional): Maximum distinct speakers for diarization.
- minSpeakers (optional): Minimum distinct speakers for diarization.
- temperature (optional): Controls randomness of sampling. Default is
0. - initialPrompt (optional): Initial text prompt for the first transcription window.
- huggingfaceAccessToken (optional): Access token for enabling diarization.
- languageDetectionMinProb (optional): Minimum detection probability for language detection.
- languageDetectionMaxTries (optional): Maximum attempts for language detection.
Example Input:
{
"debug": false,
"vadOnset": 0.5,
"audioFile": "https://replicate.delivery/pbxt/JrvsggK5WvFQ4Q53h4ugPbXW0LK2BLnMZm2dCPhM8bodUq5w/OSR_uk_000_0050_8k.wav",
"batchSize": 64,
"vadOffset": 0.363,
"alignOutput": false,
"diarization": false,
"temperature": 0
}
Output
The output of this action typically includes segments of transcribed text with their corresponding start and end times, along with the detected language. Here is what the output structure looks like:
Example Output:
{
"segments": [
{
"end": 30.811,
"text": " The little tales they tell are false. The door was barred, locked and bolted as well...",
"start": 2.585
},
{
"end": 48.592,
"text": " The room was crowded with a wild mob. This strong arm shall shield your honour...",
"start": 33.029
}
],
"detected_language": "en"
}
Conceptual Usage Example (Python)
Here’s a conceptual Python code snippet demonstrating how to call the WhisperX action using a hypothetical endpoint:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "8f3540be-1a6d-4e34-b58a-55d90ab329d6" # Action ID for Transcribe Audio with WhisperX
# Construct the input payload based on the action's requirements
payload = {
"debug": false,
"vadOnset": 0.5,
"audioFile": "https://replicate.delivery/pbxt/JrvsggK5WvFQ4Q53h4ugPbXW0LK2BLnMZm2dCPhM8bodUq5w/OSR_uk_000_0050_8k.wav",
"batchSize": 64,
"vadOffset": 0.363,
"alignOutput": false,
"diarization": false,
"temperature": 0
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action_id corresponds to the "Transcribe Audio with WhisperX" action, and the payload contains the required input parameters. The endpoint URL and structure are illustrative; they will depend on the actual API documentation.
Conclusion
Integrating the WhisperX Cognitive Actions into your applications allows for efficient audio transcription with enhanced features like speaker diarization and word-level timestamps. By leveraging this powerful tool, developers can streamline workflows and enhance user experiences. As a next step, consider experimenting with various audio files and configurations to see how the transcription results can be tailored to your specific needs.