Rapid Speech-to-Text Transcription with Incredibly Fast Whisper Actions

In the fast-paced world of application development, integrating advanced functionalities can significantly enhance user experience. The Incredibly Fast Whisper API provides a powerful Cognitive Action for speech-to-text transcription, allowing developers to transcribe audio files rapidly with speaker diarization. Leveraging the capabilities of the Insanely Fast Whisper Large v3 model and Hugging Face Transformers, this action promises to transcribe up to 150 minutes of audio in under 2 minutes on robust GPU systems.
Prerequisites
Before diving into the integration of the Cognitive Actions, ensure you have the following:
- API Key: You will need an API key for accessing the Cognitive Actions platform.
- Setup: Familiarity with making HTTP requests and handling JSON data will be beneficial.
Authentication typically involves including your API key in the request headers, ensuring secure access to the service.
Cognitive Actions Overview
Perform Speech-to-Text Transcription with Diarization
This action enables rapid transcription of audio files while distinguishing between different speakers (diarization), making it ideal for applications in meetings, interviews, and multi-speaker scenarios.
- Category: Speech-to-Text
Input
The input schema requires the following fields:
- audio (string, required): The URI of the audio file to process.
- task (string, optional): Specifies whether to transcribe or translate the audio. Default is "transcribe".
- language (string, optional): The spoken language in the audio. Automatic detection is enabled if set to "None".
- batchSize (integer, optional): Number of parallel batches to compute simultaneously. Default is 24.
- timestamp (string, optional): Determines the level of timestamps, either "chunk" or "word". Default is "chunk".
- diariseAudio (boolean, optional): Enables audio diarization when set to true. Default is false.
- maximumSpeakers (integer, optional): Maximum speakers to identify in the audio.
- minimumSpeakers (integer, optional): Minimum speakers to identify in the audio.
- huggingfaceToken (string, optional): Token for using Pyannote.audio in diarization.
- numberOfSpeakers (integer, optional): Exact number of speakers if known.
Example Input:
{
"task": "transcribe",
"audio": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav",
"language": "None",
"batchSize": 24,
"timestamp": "chunk",
"diariseAudio": false
}
Output
The action typically returns the transcribed text along with an array of text chunks, each associated with timestamps.
Example Output:
{
"text": "the little tales they tell are false...",
"chunks": [
{
"text": "the little tales they tell are false...",
"timestamp": [0, 29.72]
},
{
"text": "with a mild wab...",
"timestamp": [29.72, 38.98]
},
{
"text": "honour. She blushed...",
"timestamp": [38.98, 48.52]
}
]
}
Conceptual Usage Example (Python)
Here’s how you might call this action using a conceptual Python snippet:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "f3f2beeb-8ff2-400a-92b0-37059c602980" # Action ID for Perform Speech-to-Text Transcription with Diarization
# Construct the input payload based on the action's requirements
payload = {
"task": "transcribe",
"audio": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav",
"language": "None",
"batchSize": 24,
"timestamp": "chunk",
"diariseAudio": False
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, you’ll replace the placeholders with your API key and the action ID. The input payload is structured according to the schema defined above, ensuring a successful request.
Conclusion
The Incredibly Fast Whisper Cognitive Action for speech-to-text transcription offers developers a simple yet powerful tool to enhance their applications with audio processing capabilities. With rapid transcription, speaker diarization, and easy integration, it's an invaluable resource for any developer looking to implement advanced audio functionalities. Consider exploring further use cases, such as real-time transcription or integrating with chat applications, to fully leverage the potential of this action.