Enhance Audio Transcription Accuracy with Cog Whisperx

In today's digital landscape, audio content is everywhere—from podcasts and webinars to interviews and lectures. Transcribing this audio efficiently and accurately is crucial for accessibility, content creation, and data analysis. The "Cog Whisperx Withprompt" service offers developers a powerful audio transcription solution that leverages advanced machine learning capabilities to transcribe audio files. This service includes an optional initial text prompt to enhance contextual accuracy, making it particularly useful for applications requiring precise understanding of audio content.
With features like debugging capabilities and word-level timing, developers can easily integrate this service into their applications, streamlining the transcription process while ensuring high-quality results. Common use cases include creating transcripts for media content, generating subtitles for videos, and converting spoken language into text for documentation purposes.
Prerequisites
To get started with the Cog Whisperx Withprompt service, you will need a valid Cognitive Actions API key and a basic understanding of making API calls.
Execute WhisperX Transcription with Initial Prompt
The "Execute WhisperX Transcription with Initial Prompt" action allows you to transcribe audio files with the added benefit of an initial text prompt. This prompt can significantly improve the contextual understanding of the transcription, especially in scenarios where specific terminology or context is critical.
Input Requirements
The input for this action requires a structured JSON object with the following properties:
- audio: A URI link to the audio file in a supported format (e.g., MP3).
- debug: (Optional) A boolean value to enable debugging output.
- batchSize: (Optional) An integer specifying how many audio inputs to process in parallel (default is 32).
- alignOutput: (Optional) A boolean indicating if word-level timing should be included in the output.
- initialPrompt: (Optional) A string providing context for the first segment of audio, enhancing the transcription’s accuracy.
Example Input:
{
"audio": "https://replicate.delivery/pbxt/JP7lsBh30UlqIxXbmhqUVUH8BqdRBAvGm5TFaREzXPiRSCqE/Jaws%20%281975%29%20-%20The%20Indianapolis%20Speech%20Scene%20%287_10%29%20_%20Movieclips.mp3",
"debug": false,
"batchSize": 32,
"alignOutput": false,
"initialPrompt": "Japanese Submarine"
}
Expected Output
The output will be a structured response containing transcriptions of the audio, with each segment providing the start and end times as well as the transcribed text. This allows developers to align text with audio playback accurately.
Example Output:
[
{
"start": 1.122,
"end": 30.822,
"text": " A Japanese submarine slammed two torpedoes into our side, Chief..."
},
{
"start": 31.784,
"end": 60.1,
"text": " it held by looking from the dorsal to the tail..."
},
...
]
Use Cases for this Specific Action
- Media Production: Transcribing interviews or discussions for easier content creation and editing.
- Education: Generating transcripts of lectures or discussions to improve accessibility for students.
- Research: Converting audio data from interviews or focus groups into text for analysis.
- Entertainment: Creating subtitles for films or videos, improving viewer engagement.
import requests
import json
# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"
action_id = "41242485-8226-4855-bc97-42edec1b0bbb" # Action ID for: Execute WhisperX Transcription with Initial Prompt
# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
"audio": "https://replicate.delivery/pbxt/JP7lsBh30UlqIxXbmhqUVUH8BqdRBAvGm5TFaREzXPiRSCqE/Jaws%20%281975%29%20-%20The%20Indianapolis%20Speech%20Scene%20%287_10%29%20_%20Movieclips.mp3",
"debug": false,
"batchSize": 32,
"alignOutput": false,
"initialPrompt": "Japanese Submarine"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json",
# Add any other required headers for the Cognitive Actions API
}
# Prepare the request body for the hypothetical execution endpoint
request_body = {
"action_id": action_id,
"inputs": payload
}
print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json=request_body
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully. Result:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body (non-JSON): {e.response.text}")
print("------------------------------------------------")
Conclusion
The Cog Whisperx Withprompt service provides developers with a robust tool for audio transcription, enhancing accuracy through contextual prompts. Whether for media production, educational purposes, or research analysis, this action streamlines the transcription process, making it faster and more reliable. By integrating this service into your applications, you can significantly improve the accessibility and usability of audio content. Consider exploring more advanced features and additional Cognitive Actions to maximize the potential of your audio data.