Enhance Audio Transcription Accuracy Using Whisper Word Timestamps

In the world of audio processing, achieving precise transcription can be a daunting task. The collectiveai-team/whisper-wordtimestamps API provides developers with a powerful toolset to enhance word-level timestamp accuracy in audio transcripts using OpenAI's Whisper model. This set of Cognitive Actions offers pre-built capabilities that streamline the process of audio analysis, allowing developers to focus on building innovative applications rather than reinventing the wheel.
Prerequisites
To integrate the Whisper Word Timestamps actions into your application, you'll need:
- An API key for the Cognitive Actions platform, which you will use for authentication.
- Basic familiarity with making HTTP requests and handling JSON data.
In general, authentication can be accomplished by passing your API key in the headers of your requests.
Cognitive Actions Overview
Enhance Word Timestamps with Whisper
This action enhances the accuracy of word-level timestamps in audio transcripts, providing precise synchronization of spoken words within the audio.
- Category: Audio Transcription
- Input: The input schema requires several fields, some of which are optional. Below is a description of key fields along with an example input JSON.
Input Schema
| Field | Type | Description |
|---|---|---|
| audio | string | URI of the audio file to be processed. |
| model | string | Select a Whisper model (options: tiny, base, small, medium, large-v1, large-v2). Default is "base". |
| audioUrl | string | URL pointing to the location of the audio file. |
| language | string | Language spoken in the audio. Specify 'None' for automatic language detection. |
| wordTimestamps | boolean | Enables extraction of word-level timestamps. Default is false. |
| temperature | number | Defines randomness in sampling. Lower values produce more deterministic outputs. Default is 0. |
| suppressTokens | string | Token IDs to suppress during sampling. Using '-1' suppresses most special characters. |
Example Input
{
"model": "base",
"audioUrl": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3",
"temperature": 0,
"suppressTokens": "-1",
"wordTimestamps": false,
"logProbThreshold": -1,
"noSpeechThreshold": 0.6,
"appendPunctuations": "\"'.。,,!!??::”)]}、",
"prependPunctuations": "\"'“¿([{-",
"conditionOnPreviousText": true,
"compressionRatioThreshold": 2.4,
"temperatureIncrementOnFallback": 0.2
}
Output
The action typically returns an array of segments, each containing the text and associated timestamps. Below is an example output JSON.
{
"segments": [
{
"id": 0,
"start": 0,
"end": 6.2,
"text": " What are some cool synthetic organisms...",
"avg_logprob": -0.2868123466585889,
"no_speech_prob": 0.0683993548154831,
"compression_ratio": 1.7454545454545454
},
{
"id": 1,
"start": 6.72,
"end": 13.48,
"text": " What do you imagine what do you hope to build...",
"avg_logprob": -0.2868123466585889,
"no_speech_prob": 0.0683993548154831,
"compression_ratio": 1.7454545454545454
}
],
"transcription": " What are some cool synthetic organisms that you think about...",
"detected_language": "english"
}
Conceptual Usage Example (Python)
Here’s how you might call this action using a hypothetical Cognitive Actions execution endpoint with Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "10e7fc2f-bc74-4c54-b7c4-10a81f9af2dc" # Action ID for Enhance Word Timestamps with Whisper
# Construct the input payload based on the action's requirements
payload = {
"model": "base",
"audioUrl": "https://replicate.delivery/pbxt/IZjTvet2ZGiyiYaMEEPrzn0xY1UDNsh0NfcO9qeTlpwCo7ig/lex-levin-4min.mp3",
"temperature": 0,
"suppressTokens": "-1",
"wordTimestamps": false,
"logProbThreshold": -1,
"noSpeechThreshold": 0.6,
"appendPunctuations": "\"'.。,,!!??::”)]}、",
"prependPunctuations": "\"'“¿([{-",
"conditionOnPreviousText": true,
"compressionRatioThreshold": 2.4,
"temperatureIncrementOnFallback": 0.2
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, replace the placeholder for your API key and ensure the endpoint URL is accurate. The input payload is structured as per the requirements, and the response will provide the enhanced timestamps along with the transcription.
Conclusion
The Enhance Word Timestamps with Whisper action empowers developers to produce more accurate audio transcriptions, enhancing user experiences in applications that rely on precise audio data. Exploring this action can lead to various exciting applications, from content creation to accessibility solutions. As you integrate this capability, consider experimenting with different models and parameters to optimize performance for your specific use cases. Happy coding!