Integrate Fast Speech Recognition into Your App with WhisperX Actions

Automatic Speech Recognition (ASR) has made significant strides in recent years, enabling developers to create applications that can transcribe and analyze spoken language with remarkable accuracy. The WhisperX API brings high-speed speech recognition capabilities to your applications, achieving up to 70x real-time processing with its advanced features. This article will guide you through the integration of WhisperX's Cognitive Actions, specifically focusing on performing fast speech recognition.
Introduction
WhisperX enhances OpenAI's native Whisper model, offering a robust solution for speech-to-text tasks. It supports multiple languages, speaker diarization, and provides word-level timestamps, making it an ideal choice for applications that require precise audio transcription. By utilizing pre-built actions, developers can quickly integrate sophisticated speech recognition capabilities into their applications without needing to dive into the complexities of machine learning models.
Prerequisites
Before you begin integrating WhisperX Cognitive Actions, ensure you have:
- An API key for the WhisperX service.
- Access to the internet for API calls.
- Basic knowledge of JSON and HTTP requests.
Authentication typically involves passing your API key in the headers of your requests.
Cognitive Actions Overview
Perform Fast Speech Recognition with WhisperX
Description:
This action allows you to utilize WhisperX for high-speed automatic speech recognition. It provides features such as word-level timestamps and supports speaker diarization, which can be particularly helpful in applications involving multiple speakers.
Category: Speech-to-Text
Input: The input schema for this action consists of several properties:
{
"type": "object",
"properties": {
"task": {
"type": "string",
"default": "transcribe",
"description": "Specifies the task to perform: either transcribe or translate."
},
"debug": {
"type": "boolean",
"default": false,
"description": "Enable this for detailed debugging information."
},
"diarize": {
"type": "boolean",
"default": false,
"description": "Set to true to perform speaker diarization in the result."
},
"audioUrl": {
"type": "string",
"description": "The URL of the audio file."
},
"language": {
"type": "string",
"description": "The original language of the audio to reduce errors in transcription."
},
"onlyText": {
"type": "boolean",
"default": false,
"description": "Return only the transcribed text."
},
"audioFile": {
"type": "string",
"description": "Local URI or path of the audio file."
},
"batchSize": {
"type": "integer",
"default": 32,
"description": "Determines the number of audio segments processed in parallel."
},
"alignOutput": {
"type": "boolean",
"default": false,
"description": "Enable this to include word-level timing information."
},
"fileExtension": {
"type": "string",
"default": ".wav",
"description": "The file extension of the audio file."
}
}
}
Example Input: Here's a practical example of a JSON payload you would send to invoke the action:
{
"task": "transcribe",
"debug": false,
"diarize": true,
"onlyText": false,
"audioFile": "https://replicate.delivery/pbxt/JhUhwsTpsFNXb9URfYaU7keZ1ncFhbCfQvsBsr98QwzvxrFm/video%20%2880%29.mp4",
"batchSize": 32,
"alignOutput": true,
"fileExtension": ".wav"
}
Output: The action typically returns a structured response containing the transcribed text along with additional metadata, such as timestamps and speaker identifiers. Here's an example of the expected output:
{
"language": "en",
"segments": [
{
"start": 0.068,
"end": 2.63,
"text": "It's always interesting you look sometimes at catcher E.R.A.",
"speaker": "SPEAKER_00",
"words": [
{"start": 0.068, "end": 0.168, "word": "It's"},
{"start": 0.188, "end": 0.368, "word": "always"},
{"start": 0.388, "end": 0.688, "word": "interesting", "speaker": "SPEAKER_00"},
// additional word segments...
]
},
// additional segments...
]
}
Conceptual Usage Example (Python): Here’s how a developer might call the WhisperX action using Python. This example structurally demonstrates how to send a request to the Cognitive Actions API:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "d9924241-a2dd-4435-86ad-9c1679e77edc" # Action ID for Perform Fast Speech Recognition with WhisperX
# Construct the input payload based on the action's requirements
payload = {
"task": "transcribe",
"debug": False,
"diarize": True,
"onlyText": False,
"audioFile": "https://replicate.delivery/pbxt/JhUhwsTpsFNXb9URfYaU7keZ1ncFhbCfQvsBsr98QwzvxrFm/video%20%2880%29.mp4",
"batchSize": 32,
"alignOutput": True,
"fileExtension": ".wav"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet:
- Replace
YOUR_COGNITIVE_ACTIONS_API_KEYwith your actual API key. - The payload is structured according to the input schema defined for the action.
- The response is printed in a formatted JSON structure for easy viewing.
Conclusion
Integrating the WhisperX Cognitive Action for fast speech recognition can significantly enhance the capabilities of your application. With support for multiple languages, speaker diarization, and high-speed processing, WhisperX equips developers with the tools to create innovative and efficient speech-related applications.
Consider exploring further use cases such as real-time transcription services, multilingual support for global applications, or interactive voice response systems to leverage the full potential of WhisperX. Happy coding!