Accelerate Your Audio Transcription with WhisperX Cognitive Actions

In today's digital world, audio transcription is a crucial tool for various applications, from content creation to accessibility. The WhisperX Cognitive Actions offer a powerful solution for developers looking to streamline audio transcription processes. With features like word-level timestamps and speaker diarization, these actions enhance the transcription experience across multiple languages, providing an efficient way to convert audio into text.
Prerequisites
Before diving into the integration of WhisperX Cognitive Actions, ensure you have the following:
- An API key for the Cognitive Actions platform.
- A basic understanding of RESTful APIs and JSON.
- Access to audio files that you wish to transcribe.
Authentication typically involves passing your API key in the headers of your requests, ensuring that you have the necessary permissions to execute the actions.
Cognitive Actions Overview
Perform Accelerated Audio Transcription with Word-Level Timestamps
This action utilizes WhisperX to perform fast audio transcription while providing enhanced word-level timestamps and speaker diarization. It supports multiple languages for transcription, with alignment specifically for English.
Input
The input for this action requires a JSON object that includes the following fields:
- audio (required): A URI pointing to the audio file to be processed.
- debug (optional): A boolean value to enable memory usage logging for debugging (default is
false). - onlyText (optional): A boolean to specify if only transcribed text should be returned (default is
false). - batchSize (optional): An integer that determines the number of parallel transcriptions (default is
32). - alignOutput (optional): A boolean to enable word-level timing information in the output (default is
false).
Example Input:
{
"audio": "https://replicate.delivery/pbxt/J5r78wKSymorzW9idAbbbJ7iXQl9GddZTwfdX5OlLJW2hLR2/OSR_uk_000_0050_8k.wav",
"batchSize": 32
}
Output
The action returns an array of objects, each containing:
- start: The start time of the transcribed segment.
- end: The end time of the transcribed segment.
- text: The transcribed text of the audio segment.
Example Output:
[
{
"end": 30.772,
"text": " De små hænder, de fortæller, er falske. Døren var barret, lukket og boltet også. Række pærer er tilfældige for en kvindestabel. En stor, varm sten stod på det runde skjæl. Kajten døbte og svævede, men blevede aloft. De nødvendige tider flyver meget for snart. Rommet var overrasket med en mild vand.",
"start": 2.557
},
{
"end": 48.558,
"text": " Rommet var fulgt med en vild mob. Denne stærke arme vil skylde din ægte. Hun bluskede, da han gav hende en hvid orkid. Betlen dronede i den varme julesøvn.",
"start": 32.999
}
]
Conceptual Usage Example (Python)
Below is a conceptual example of how you might call the WhisperX action using Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "4ffa28ca-c4ce-4ee4-9eda-48ee4ddd9d77" # Action ID for Perform Accelerated Audio Transcription with Word-Level Timestamps
# Construct the input payload based on the action's requirements
payload = {
"audio": "https://replicate.delivery/pbxt/J5r78wKSymorzW9idAbbbJ7iXQl9GddZTwfdX5OlLJW2hLR2/OSR_uk_000_0050_8k.wav",
"batchSize": 32
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this Python code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action_id corresponds to the transcription action, and the input payload is structured according to the action's requirements. The endpoint URL and request structure are illustrative and should align with the actual API documentation.
Conclusion
By leveraging the WhisperX Cognitive Actions, developers can efficiently transcribe audio files with advanced features like word-level timestamps and speaker diarization. This capability not only enhances accessibility but also improves content usability across various applications. As you explore these actions, consider how they can be integrated into your projects to streamline your audio processing workflows. Happy coding!